Releases: Mozilla-Ocho/llamafile
llamafile v0.8.13
llamafile lets you distribute and run LLMs with a single file
llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. It features the best of llama.cpp and cosmopolitan libc while aiming to stay ahead of the curve by including the most cutting-edge performance and accuracy enhancements. What llamafile gives you is a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.
v0.8.13 changes
This release synchronizes with upstream projects, bringing with it
support for the newest models (e.g. Gemma 2B). Support for LLaMA v3 has
been significantly improved.
The new llamafiler server is now able to serve 2400 embeddings per
second on CPU. That's 3x faster than the llama.cpp server upstream. It's
now hardened for security. You should be able to safely use it a public
facing web server. There's a man page for llamafiler. You can also read
the docs online: /llamafile/server/doc/index.md.
- 070aa13 Bring new server up to 2421 embedding/sec
- 584a327 Increase tokens per second on tiny models
- 99dd1c0 Add seccomp, tokenbucket, and batch prioritization
- cda83f8 Make GGML threads spawn 10x faster
- d451e0e Add chrome://tracing/ feature
The new llamafiler server now fully supports all the old embedding
endpoints that were provided by llamafile --server
. Support for
serving embeddings has been removed from the old server.
- be94c1f Add OpenAI /v1/embeddings to new llamafiler server
This release introduces whisperfile which is a single-file
implementation of OpenAI's Whisper model. It lets you transcribe speech
to text and even translate it too. Our implementation is based off
Georgi Gerganov's whisper.cpp project.
The project to turn it into a whisperfile was
founded by CJ Pais who's handed over maintenance of his awesome work.
There's a man page for whisperfile (which also can be viewed by running
./whisperfile --help
) and we have online documentation with markdown
tutorials at /whisper.cpp/doc/index.md.
- fd891be Merge whisperfile into llamafile (#517)
- 7450034 Use colorblind friendly TTY colors in whisperfile
- ggerganov/whisper.cpp#2360 (our fork is upstreaming changes)
We developed a faster, more accurate implementation of GeLU. This helps
improve the performance of tiny models. It leads to measurable quality
improvements in whisper model output.
- 8ace604 Write explicitly vectorized GeLU functions
- b5748f3 Implement the real trick to GeLU with proof
- ggerganov/llama.cpp#8878 (our fork is upstreaming changes)
We've been improving floating point numerical stability for very large
models, e.g. Mixtral 8x22b and Command-R-Plus. tinyBLAS on CPU for F32,
F16, and BF16 weights now uses a new zero-overhead divide-and-conquer
approach to computing dot products, which we call ruler reduction, that
can result in a 10x reduction in worst case roundoff error accumulation.
- cb817f5 Reduce rounding errors for very large models
- 5b06924 Use ruler reduction for GGML dot products
This release introduces sdfile, which is our implementation of stable
diffusion. No documentation is yet provided for this command, other than
the docs provided by the upstream stable-diffusion.cpp
project on which it's based.
The list of new architectures and tokenizers introduced by this version are:
Open ELM, GPT NEOX, Arctic, DeepSeek2, ChatGLM, BitNet, T5, JAIS, Poro,
Viking, Tekken, and CodeShell.
Known Issues
The llamafile executable size is increased from 30mb to 200mb by this release.
This is caused by ggerganov/llama.cpp#7156. We're already employing some
workarounds to minimize the impact of upstream development contributions
on binary size, and we're aiming to find more in the near future.
llamafile v0.8.12
llamafile v0.8.11
- 7469a23 Add smaug-bpe tokenizer
llamafile v0.8.10
llamafile lets you distribute and run LLMs with a single file
llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. It features the best of llama.cpp and cosmopolitan libc while aiming to stay ahead of the curve by including the most cutting-edge performance and accuracy enhancements. What llamafile gives you is a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.
This release includes a build of the new llamafile server rewrite we've
been promising, which we're calling llamafiler
. It's matured enough to
recommend for embedding serving. This is the fastest way to serve
embeddings. If you use it with all-MiniLM-L6-v2.Q6_K.gguf then on
Threadripper it can serve JSON /embedding at 800 req/sec whereas the old
llama.cpp server could only do 100 req/sec. So you can fill up your RAG
databases very quickly if you productionize this.
The old llama.cpp server came from a folder named "examples" and was
never intended to be production worthy. This server is designed to be
sturdy and uncrashable. It has /completion and /tokenize endpoints too,
which serves 3.7 million requests per second on Threadripper, thanks to
Cosmo Libc improvements.
See the LLaMAfiler Documentation for further details.
- 73b1836 Write documentation for new server
- b3930aa Make GGML asynchronously cancelable
- 8604e9a Fix POSIX undefined cancelation behavior
- 323f50a Let SIGQUIT produce per-thread backtraces
- 15d7fba Use semaphore to limit GGML worker threads
- d7c8e33 Add support for JSON parameters to new server
- 7f099cd Make stack overflows recoverable in new server
- fb3421c Add barebones /completion endpoint to new server
This release restores support for non-AVX x86 microprocessors. We had to
drop support at the beginning of the year. However our CPUid dispatching
has advanced considerably since then. We're now able to offer top speeds
on modern hardware, without leaving old hardware behind.
Here's the remaining improvements included in this release:
llamafile v0.8.9
This release gets Gemma2 working closer to how Google intended.
- af22695 Make gemma2-27b-it the same as aistudio.google.com
- 41678c8 Add sliding window mask for Gemma2
- 140eed5 Add soft-capping to Gemma2
This release fixes Android support. You can now run LLMs on your phone
using Cosmopolitan software like llamafile. Thank you @aj47 (techfren.net)
for bug reports and and testing efforts. See also other bug fixes described
by the Cosmopolitan v3.5.4 and v3.5.3 release notes.
Our future replacement for the server now has an /embedding endpoint. On
my workstation, it's currently able to serve 851 requests per second for
a prompt with 52 tokens, using the all-MiniLM-L6-v2.Q6_K.gguf embeddings
model. None of the requests fail and 99th percentile latency is 56.74ms.
- 1346ef4 Create /embedding endpoint in new server
- 263d39b Use float to string conversion
- 0d62d05 Reclaim llama_decode() memory on cancelation
- 617d841 Remove ggml_context cache
- 46dda4f Refactor new server and get leak checker working
- cd73243 Prevent vector overflow in llama.cpp
You can try the new embedding server as follows:
make -j o//llamafile/server/main
o//llamafile/server/main -m /weights/all-MiniLM-L6-v2.F32.gguf
curl http://127.0.0.1:8080/embedding?prompt=orange
Compatibility with the old server's API of posting JSON content will be
added in upcoming changes. The same goes for the OpenAI API. The goal's
to be compatible with everything.
llamafile v0.8.8
llamafile v0.8.7
This release includes important performance enhancements for quants.
- 293a528 Performance improvements on Arm for legacy and k-quants (#453)
- c38feb4 Optimized matrix multiplications for i-quants on
__aarch64__
(#464)
This release fixes bugs. For example, we're now using a brand new memory
manager, which is believed to support platforms like Android that have a
virtual address space with fewer than 47 bits. This release also restores our
prebuilt Windows AMD GPU support, thanks to tinyBLAS.
- 0c0e72a Upgrade to Cosmopolitan v3.5.1
- 629e208 Fix server crash due to /dev/urandom
- 60404a8 Always use tinyBLAS with AMD GPUs on Windows
- 6d3590c Pacify --temp flag when running in server mode
- a28250b Update GGML_HIP_UMA (#473)
- e973fa2 Improve CPU brand detection
- 9cd8d70 Update sever README build/testing instructions (#461)
It should be noted that, in future releases, we plan to introduce a new
server for llamafile. This new server is being designed for performance
and production-worthiness. It's not included in this release, since the
new server currently only supports a tokenization endpoint. However the
endpoint is capable of doing 2 million requests per second whereas with
the current server, the most we've ever seen is a few thousand.
- e0656ea Introduce new llamafile server
llamafile v0.8.6
Two minor issues are fixed with this release.
- 69c2dd3 Don't print special tokens for now (improve shell scriptability)
- 866a129 Upgrade to Cosmopolitan v3.3.8
See the llamafile v0.8.5 release notes for further details. For driver-only prebuilt AMD GPU support on Windows, please use llamafile v0.8.4 for the next few weeks, until ggerganov/llama.cpp#7156 is resolved.
llamafile v0.8.5
This release fixes bugs and introduces @Kawrakow's latest quant
performance enhancements (a feature exclusive to llamafile). As of #435
the K quants now go consistently 2x faster than llama.cpp upstream. On
big CPUs like Threadripper we've doubled the performance of tiny models,
for both prompt processing and token generation for tiny models (see the
benchmarks below) The llamafile-bench
and llamafile-upgrade-engine
commands have been introduced.
- a86e7ce Add Script To Upgrade llamafile Archives (#412)
- 07e87bf 261dfe7 Fix llamafile-quantize and rewrite documentation
- 938cf72 Faster AVX2 matrix multiplications for MoE models (#428)
- eaa756d Faster AVX2 matrix multiplications for legacy quants (#405)
- 7cb15c6 Another performance optimization for Zen4 + refactoring (#435)
- 9206719 8b2f8d8 e675719 4451c6d Introduce llamafile-bench command (cpu mode only)
- 87d4ce1 Fix f16 cpuid check (caused crashes on sandybridge)
- 5c40565 8d1afe4 Avoid crashing on llava ctrl-c
- c0aa43e Introduce bf16 cuda support
- 00e4f72 Enable GGML_CUDA_FORCE_MMQ in tinyBLAS mode
- d228e01 0b5997d 64fbffc Sync with llama.cpp upstream (#427)
- c660d38 Add text embedding models to 'other example llamafiles' table (#422)
- 49cc13c Updated README with instructions to load models from third-party apps (#417)
Note: Please use llamafile v0.8.4 if you need prebuilt (driver-only) AMD GPU support on Windows,
at least for the next few weeks, until ggerganov/llama.cpp#7156 is resolved.
Binaries run on Linux, Windows, MacOS, FreeBSD, OpenBSD, and NetBSD for
AMD64 and ARM64. Supported GPUs are CUDA, ROCm, and Metal. Prebuilt GPU
binaries are provided for CUDA/ROCm on Linux, and CUDA on Windows. To
install this release on systems with a POSIX-style shell:
sudo -s
cd /usr/local
wget https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.5/llamafile-0.8.5.zip
unzip llamafile-0.8.5.zip
exit
llamafile --help
To upgrade your old llamafiles without needing to redownload, run:
llamafile-upgrade-engine old.llamafile new.llamafile
Prebuilt llamafiles that have the LLM weights included are available at:
- https://huggingface.co/Mozilla (official)
- https://huggingface.co/models?library=llamafile (community)
Here are some tutorials:
- https://justine.lol/oneliners/
- https://github.com/mozilla-ocho/llamafile/
- https://future.mozilla.org/news/llamafiles-for-embeddings-in-local-rag-applications/
- https://blog.mozilla.ai/local-llm-as-judge-evaluation-with-lm-buddy-prometheus-and-llamafile/
- https://www.docker.com/blog/a-quick-guide-to-containerizing-llamafile-with-docker-for-ai-applications/
Here are some performance benchmarks for various quantization formats, on the world's flagship CPUs. See https://justine.lol/matmul/ to compare these numbers to where we were back in March two months ago.
cpu_info | model_filename | size | test | t/s |
---|---|---|---|---|
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.BF16 | 86.99 GiB | pp512 | 447.01 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.BF16 | 86.99 GiB | tg16 | 11.35 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.F16 | 86.99 GiB | pp512 | 340.63 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.F16 | 86.99 GiB | tg16 | 11.01 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.Q8_0 | 46.22 GiB | pp512 | 288.16 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.Q8_0 | 46.22 GiB | tg16 | 15.82 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.Q6_K | 35.74 GiB | pp512 | 431.51 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.Q6_K | 35.74 GiB | tg16 | 22.73 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.Q5_K_M | 30.95 GiB | pp512 | 427.71 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.Q5_K_M | 30.95 GiB | tg16 | 24.90 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.Q4_K_M | 26.49 GiB | pp512 | 440.03 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.Q4_K_M | 26.49 GiB | tg16 | 27.31 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.Q4_0 | 24.63 GiB | pp512 | 287.51 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.Q4_0 | 24.63 GiB | tg16 | 18.92 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.Q3_K_M | 21.00 GiB | pp512 | 433.89 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.Q3_K_M | 21.00 GiB | tg16 | 30.30 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.Q3_K_S | 19.03 GiB | pp512 | 432.36 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.Q3_K_S | 19.03 GiB | tg16 | 31.34 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.Q2_K | 16.12 GiB | pp512 | 449.64 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | mixtral-8x7b-instruct-v0.1.Q2_K | 16.12 GiB | tg16 | 33.71 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.F32 | 4.10 GiB | pp512 | 2103.25 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.F32 | 4.10 GiB | tg16 | 57.34 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.BF16 | 2.05 GiB | pp512 | 2603.84 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.BF16 | 2.05 GiB | tg16 | 77.18 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.F16 | 2.05 GiB | pp512 | 2038.64 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.F16 | 2.05 GiB | tg16 | 80.23 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q8_0 | 1.09 GiB | pp512 | 2203.77 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q8_0 | 1.09 GiB | tg16 | 100.78 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q6_K | 860.86 MiB | pp512 | 2838.05 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q6_K | 860.86 MiB | tg16 | 135.27 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q5_1 | 791.50 MiB | pp512 | 2328.06 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q5_1 | 791.50 MiB | tg16 | 138.15 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q5_K_M | 745.11 MiB | pp512 | 2676.14 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q5_K_M | 745.11 MiB | tg16 | 143.58 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q5_0 | 729.84 MiB | pp512 | 2281.44 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q5_0 | 729.84 MiB | tg16 | 145.02 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q5_K_S | 729.84 MiB | pp512 | 2757.59 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q5_K_S | 729.84 MiB | tg16 | 143.59 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q4_1 | 668.18 MiB | pp512 | 2444.11 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q4_1 | 668.18 MiB | tg16 | 148.50 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q4_K_M | 636.18 MiB | pp512 | 2758.90 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q4_K_M | 636.18 MiB | tg16 | 149.92 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q4_K_S | 609.53 MiB | pp512 | 2847.95 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q4_K_S | 609.53 MiB | tg16 | 150.84 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q4_0 | 606.53 MiB | pp512 | 2420.58 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q4_0 | 606.53 MiB | tg16 | 154.27 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q3_K_L | 563.42 MiB | pp512 | 2743.74 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q3_K_L | 563.42 MiB | tg16 | 155.29 |
AMD Ryzen Threadripper PRO 7995WX (znver4) | TinyLlama-1.1B-Chat-v1.0.Q3_K_M | 522.30 MiB... |
llamafile v0.8.4
This release fixes underflows and overflows.
-
A memory bug in the grammar parser has been fixed, that caused commands like
./llamafile -m foo.gguf -p bar --grammar 'root::="'
(which failed to specify a closing quote) to crash. Anyone using the server as a public facing endpoint (despite our previous recommendations) is strongly encouraged to upgrade. See 22aba95 and 3fe045f. Credit for discovering (and most importantly, reporting) this issue goes to Eclypsium Security Researcher Richard Johnson. We incorrectly reported earlier that this fix was incorporated into the v0.8.2 release. You need to use the v0.8.4 release. This bug fix was upstreamed in ggerganov/llama.cpp#7194 -
Our new vectorized expf() implementation now handles underflow by producing subnormals rather than flushing to zero. b5c6df6
See these instructions for how to put the latest llamafile software into your old weights, without having to redownload. #24 (comment)