Releases · Mozilla-Ocho/llamafile

01 Dec 01:00

jart

0.8.17

c88f2d3

llamafile v0.8.17 Latest

Latest

llamafiler has a new web UI which supports two modes of operation:
chatbot and raw completion. Its syntax highlighting is just as advanced
as the CLI chatbot. It looks much nicer than the old web ui. In a future
release, llamafiler will be folded into llamafile to replace the old server.

988c9ec Introduce raw completions web ui
241bf21 Introduce /v1/completions endpoint in new server
6d89f8f Add binary safety check to server
d18ddf1 Add redo button to new web ui
bc82424 Add settings modal to web ui
bb917bd Add vision model support to new server
4c7b7d5 Implement data URI parser
fb4b3e6 Fix JSON parser bug
9d6f89f Improve look and printability of new web ui
25b6910 Make chatbot ui more printer friendly
30518ca Respond to HTTP OPTIONS requests
41abfa3 Work around multiple image handling
35bc088 Make default system prompt configurable on web
28c8e22 Scale and decimate images as needed in browser
14713b5 Get basic chatbot web gui working in llamafiler
ef08074 Start porting syntax highlighter to JavaScript
fdfdb13 Port remaining highlighting code to javascript

The following improvements have been made to our terminal chatbot.

12c3761 Make CLI chatbot work better with base models
e5c0921 Improve VT100 support
4b61791 Fix VT102 support
d25c077 Introduce /upload and /forget commands to chatbot
880ebc7 Handle empty system prompt better in cli chatbot

General improvements to this project.

f581c40 Fix futex prototype
54d3c72 Make LLaVA fast again
01b8d49 Remove n-gpu-layer limitation (#534)
566cdc1 Improve Gemma system prompt generation
46284fe Reduce attack surface of stb_image
9bb262b Log CUDA kernel vs. runtime versions

Syntax highlighting improvements for chatbot and web ui.

d979a1c Add BNF syntax highlighting
4a8311a Add cmake syntax highlighting
40e92cf Add Ocaml syntax highlighting
0995343 Add more Clojure keywords
0068a37 Make D syntax highlighting better
0965a4b Make some markdown improvements
9b96502 Improve JS/HTML syntax highlighting
c0622da Put more work into markdown rendering
fa1c98f Improve markdown to html rendering
8915432 Further improve markdown to html
d25fa3a Improve highlighting in new web ui
f5a0bd4 Fix JS regex highlighting issue
2807ae6 Improve Ada syntax highlighting
d30da30 Syntax highlight D properly
33a057e Improve Ruby some more
5b0fff1 Improve Ruby syntax highlighting
8413a21 Fix Ruby builtins in web gui

The latest cosmopolitan upgrade introduces a new more powerful syntax
for your .args files. They're now parsed more similarly to the shell,
with support for C style escaping in double-quoted strings. You can also
now add shell-style comments to .args files too. See tool/args/args2.c
in the cosmopolitan codebase for the definitive reference.

fb59488 Upgrade to Cosmo v3.9.7
21af0bf Import upstream bestline changes

The following example of the new .args file syntax is provided:

# specify model
-m Qwen2.5-Coder-34B-Instruct.Q6_K.gguf

# prevent flags below from being changed
...

# specify system prompt
--system-prompt "\
you are a friendly ai assistant\n
your job is to be helpful and intelligent"

# hide some stuff from user interfaces
--nologo
--no-display-prompt

You can put .args files inside llamafile, llamafiler, and whisperfile
using the zipalign program.

The following screenshots are provided of the llamafiler web ui.

Assets 9

02 Nov 03:42

jart

0.8.16

011d720

llamafile v0.8.16

Add Julia syntax highlighting support
Fix possible crash on Windows due to MT bug
Improve accuracy of chatbot context window management
The new llamafiler server now supports GPU. Pass the -ngl 999 flag.
The new llamafiler server's /v1/chat/completions endpoint now supports prompt caching. It may be configured using the --slots COUNT and --ctx-size TOKENS flags.

Assets 10

30 Oct 21:13

jart

0.8.15

3d71282

llamafile v0.8.15

The --chat bot interface now supports syntax highlighting 42 separate programming languages: ada, asm, basic, c, c#, c++, cobol, css, d, forth, fortran, go, haskell, html, java, javascript, json, kotlin, ld, lisp, lua, m4, make, markdown, matlab, pascal, perl, php, python, r, ruby, rust, scala, shell, sql, swift, tcl, tex, txt, typescript, and zig.

That chatbot now supports more commands:

/undo may be used to have the LLM forget the last thing you said. This is useful when you get a poor response and want to try asking your question a different way, without needing to start the conversation over from scratch.
/push and /pop works similarly, in the sense that it allows you to rewind a conversation to a previous state. In this case, it does so by creating save points within your context window. Additionally, /stack may be used to view the current stack.
/clear may be used to reset the context window to the system prompt, effectively starting your conversation over.
/manual may be used to put the chat interface in "manual mode" which lets you (1) inject system prompts, and (2) speak as the LLM. This could be useful in cases where you want the LLM to believe it said something when it actually didn't.
/dump may be used to print out the raw conversation history, including special tokens (that may be model specific). You can also say /dump filename.txt to save the raw conversation to a file.

We identified an issue with Google's Gemma models, where the chatbot wasn't actually inserting the system prompt. That's now fixed. So you can now instruct Gemma to do roleplaying if you pass the flags llamafile -m gemma.gguf -p "you are role playing as foo" --chat.

You can now type CTRL-J to create multi-line prompts in the terminal chatbot. It works similarly to shift-enter in the browser. It can be a quicker alternative to using the chatbot's triple quote syntax, i.e. """multi-line / message""".

Bugs in the new chatbot have been fixed. For example, we now do a better job making sure special tokens like BOS, EOS, and EOT get inserted when appropriate into the conversation history. This should improve fidelity when using the terminal chatbot interface.

The --threads and --threads-batch flags may now be used separately to tune how many threads are used for prediction and prefill.

The llamafile-bench command now supports benchmarking GPU support (see #581 from @cjpais)

Both servers now support configuring a URL prefix, thanks to (see #597 and #604 from @vlasky)

Support for the IQ quantization formats is being removed from our CUDA module to save on build times. If you want to use IQ quants with your NVIDIA hardware, you need to pass the --iq --recompile flags to llamafile once, to build a ggml-cuda module for your system that includes them.

Finally, we have an alpha release of a new /v1/chat/completions endpoint for the new llamafiler server. We're planning to build a new web interface that's based on this soon, so you're encouraged to test this, since llamafiler will eventually replace the old server too. File an issue if there's any features you need.

Contributors

cjpais and vlasky

Assets 10

14 Oct 01:56

jart

0.8.14

a28d1d5

llamafile v0.8.14

llamafile lets you distribute and run LLMs with a single file

llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. It features the best of llama.cpp and cosmopolitan libc while aiming to stay ahead of the curve by including the most cutting-edge performance and accuracy enhancements. What llamafile gives you is a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.

v0.8.14 changes

This release introduces our new CLI chatbot interface. It supports
multi-line input using triple quotes. It will syntax highlight Python,
C, C++, Java, and JavaScript code.

This chatbot is now the default mode of operation. When you launch
llamafile without any special arguments, the chatbot will be launched
in the foreground, and the server will be launched in the background.
You can use the --chat and --server flags to disambiguate this
behavior if you only want one of them.

a384fd7 Create ollama inspired cli chatbot
63205ee Add syntax highlighting to chatbot
7b395be Introduce new --chat flag for chatbot
28e98b6 Show prompt loading progress in chatbot
4199dae Make chat+server hybrid the new default mode

The whisperfile server now lets you upload mp3/ogg/flac.

74dfd21 Rewrite audio file loader code
7517a5f whisperfile server: convert files without ffmpeg (#568)

Other improvements have been made.

d617c0b Added vision support to api_like_OAI (#524)
726f6e8 Enable gpu support in llamafile-bench (#581)
c7c4d65 Speed up KV in llamafile-bench
2c940da Make replace_all() have linear complexity
fa4c4e7 Use bf16 kv cache when it's faster
20fe696 Upgrade to Cosmopolitan 3.9.4
c44664b Always favor fp16 arithmetic in tinyBLAS
98eff09 Quantize TriLM models using Q2_K_S (#552)

Assets 7

18 Aug 17:22

jart

0.8.13

b17ccd1

llamafile v0.8.13

llamafile lets you distribute and run LLMs with a single file

v0.8.13 changes

This release synchronizes with upstream projects, bringing with it
support for the newest models (e.g. Gemma 2B). Support for LLaMA v3 has
been significantly improved.

e9ee3f9 Synchronize with llama.cpp upstream
d0b5e8f Upgrade to Cosmopolitan v3.7.1

The new llamafiler server is now able to serve 2400 embeddings per
second on CPU. That's 3x faster than the llama.cpp server upstream. It's
now hardened for security. You should be able to safely use it a public
facing web server. There's a man page for llamafiler. You can also read
the docs online: /llamafile/server/doc/index.md.

070aa13 Bring new server up to 2421 embedding/sec
584a327 Increase tokens per second on tiny models
99dd1c0 Add seccomp, tokenbucket, and batch prioritization
cda83f8 Make GGML threads spawn 10x faster
d451e0e Add chrome://tracing/ feature

The new llamafiler server now fully supports all the old embedding
endpoints that were provided by llamafile --server. Support for
serving embeddings has been removed from the old server.

be94c1f Add OpenAI /v1/embeddings to new llamafiler server

This release introduces whisperfile which is a single-file
implementation of OpenAI's Whisper model. It lets you transcribe speech
to text and even translate it too. Our implementation is based off
Georgi Gerganov's whisper.cpp project.
The project to turn it into a whisperfile was
founded by CJ Pais who's handed over maintenance of his awesome work.
There's a man page for whisperfile (which also can be viewed by running
./whisperfile --help) and we have online documentation with markdown
tutorials at /whisper.cpp/doc/index.md.

fd891be Merge whisperfile into llamafile (#517)
7450034 Use colorblind friendly TTY colors in whisperfile
ggerganov/whisper.cpp#2360 (our fork is upstreaming changes)

We developed a faster, more accurate implementation of GeLU. This helps
improve the performance of tiny models. It leads to measurable quality
improvements in whisper model output.

8ace604 Write explicitly vectorized GeLU functions
b5748f3 Implement the real trick to GeLU with proof
ggerganov/llama.cpp#8878 (our fork is upstreaming changes)

We've been improving floating point numerical stability for very large
models, e.g. Mixtral 8x22b and Command-R-Plus. tinyBLAS on CPU for F32,
F16, and BF16 weights now uses a new zero-overhead divide-and-conquer
approach to computing dot products, which we call ruler reduction, that
can result in a 10x reduction in worst case roundoff error accumulation.

cb817f5 Reduce rounding errors for very large models
5b06924 Use ruler reduction for GGML dot products

This release introduces sdfile, which is our implementation of stable
diffusion. No documentation is yet provided for this command, other than
the docs provided by the upstream stable-diffusion.cpp
project on which it's based.

3b7b1e3 Add stable-diffusion.cpp
25ceb2c Upgrade stable diffusion

The list of new architectures and tokenizers introduced by this version are:
Open ELM, GPT NEOX, Arctic, DeepSeek2, ChatGLM, BitNet, T5, JAIS, Poro,
Viking, Tekken, and CodeShell.

Known Issues

The llamafile executable size is increased from 30mb to 200mb by this release.
This is caused by ggerganov/llama.cpp#7156. We're already employing some
workarounds to minimize the impact of upstream development contributions
on binary size, and we're aiming to find more in the near future.

Assets 7

28 Jul 04:08

jart

0.8.12

a73ea13

llamafile v0.8.12

1839bfa Introduce --no-warmup flag
f40facc Upgrade to Cosmopolitan v3.6.2
fdd5d84 Fix build determinism issue
3e220e7 Update llamafile version in README
909f791 Make zipalign and slicehf gentler on system
dd10455 Add link to OLMo-7B in README
867c752 Fix code compatibility issues

Assets 4

23 Jul 18:13

jart

0.8.11

109e926

llamafile v0.8.11

7469a23 Add smaug-bpe tokenizer

Assets 4

23 Jul 17:53

jart

0.8.10

f7c6ef4

llamafile v0.8.10

llamafile lets you distribute and run LLMs with a single file

This release includes a build of the new llamafile server rewrite we've
been promising, which we're calling llamafiler. It's matured enough to
recommend for embedding serving. This is the fastest way to serve
embeddings. If you use it with all-MiniLM-L6-v2.Q6_K.gguf then on
Threadripper it can serve JSON /embedding at 800 req/sec whereas the old
llama.cpp server could only do 100 req/sec. So you can fill up your RAG
databases very quickly if you productionize this.

The old llama.cpp server came from a folder named "examples" and was
never intended to be production worthy. This server is designed to be
sturdy and uncrashable. It has /completion and /tokenize endpoints too,
which serves 3.7 million requests per second on Threadripper, thanks to
Cosmo Libc improvements.

See the LLaMAfiler Documentation for further details.

73b1836 Write documentation for new server
b3930aa Make GGML asynchronously cancelable
8604e9a Fix POSIX undefined cancelation behavior
323f50a Let SIGQUIT produce per-thread backtraces
15d7fba Use semaphore to limit GGML worker threads
d7c8e33 Add support for JSON parameters to new server
7f099cd Make stack overflows recoverable in new server
fb3421c Add barebones /completion endpoint to new server

This release restores support for non-AVX x86 microprocessors. We had to
drop support at the beginning of the year. However our CPUid dispatching
has advanced considerably since then. We're now able to offer top speeds
on modern hardware, without leaving old hardware behind.

a674cfb Restore support for non-AVX microprocessors
555fb80 Improve build configuration

Here's the remaining improvements included in this release:

cc30400 Supports SmolLM (#495)
4a4c065 Fix CUDA compile warnings and errors
82f845c Avoid crashing with BF16 on Apple Metal

Assets 4

01 Jul 19:11

jart

0.8.9

cd84736

llamafile v0.8.9

This release gets Gemma2 working closer to how Google intended.

af22695 Make gemma2-27b-it the same as aistudio.google.com
41678c8 Add sliding window mask for Gemma2
140eed5 Add soft-capping to Gemma2

This release fixes Android support. You can now run LLMs on your phone
using Cosmopolitan software like llamafile. Thank you @aj47 (techfren.net)
for bug reports and and testing efforts. See also other bug fixes described
by the Cosmopolitan v3.5.4 and v3.5.3 release notes.

Our future replacement for the server now has an /embedding endpoint. On
my workstation, it's currently able to serve 851 requests per second for
a prompt with 52 tokens, using the all-MiniLM-L6-v2.Q6_K.gguf embeddings
model. None of the requests fail and 99th percentile latency is 56.74ms.

1346ef4 Create /embedding endpoint in new server
263d39b Use float to string conversion
0d62d05 Reclaim llama_decode() memory on cancelation
617d841 Remove ggml_context cache
46dda4f Refactor new server and get leak checker working
cd73243 Prevent vector overflow in llama.cpp

You can try the new embedding server as follows:

make -j o//llamafile/server/main
o//llamafile/server/main -m /weights/all-MiniLM-L6-v2.F32.gguf
curl http://127.0.0.1:8080/embedding?prompt=orange

Compatibility with the old server's API of posting JSON content will be
added in upcoming changes. The same goes for the OpenAI API. The goal's
to be compatible with everything.