Releases · EAddario/llama.cpp

10 Aug 12:33

79c1160

b6123 Latest

Latest

cuda: refactored ssm_scan and use CUB (#13291)

* cuda: refactored ssm_scan to use CUB

* fixed compilation error when when not using CUB

* assign L to constant and use size_t instead of int

* deduplicated functions

* change min blocks per mp to 1

* Use cub load and store warp transpose

* suppress clang warning

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-08-10T12:33:47Z
llama-b6123-bin-macos-arm64.zip

sha256:907b7a91163352732ee8bf4adc4a911218f774a75f1f2d52eee81af2037cea1c

10.8 MB 2025-08-10T12:33:56Z
llama-b6123-bin-macos-x64.zip

sha256:31641f55dc8eb2f71dd0b3060d546a92d987fb01b047e0e015deec9efe0a662b

27.6 MB 2025-08-10T12:33:57Z
llama-b6123-bin-ubuntu-vulkan-x64.zip

sha256:8c38774cbb6b2175992dd021c4f879f73e3bb0e401579b17da7ff797ddfc669a

21.5 MB 2025-08-10T12:33:58Z
llama-b6123-bin-ubuntu-x64.zip

sha256:9fa2be541e8c0e996d56008a34bbb9fe16c0e0d694e7185a66c26fa57bbdba0a

12.7 MB 2025-08-10T12:33:59Z
llama-b6123-bin-win-cpu-arm64.zip

sha256:30af161e5c45dd22619e4cb6b1085decd3430926f139c36eb16a44309c6cfbef

11 MB 2025-08-10T12:34:00Z
llama-b6123-bin-win-cpu-x64.zip

sha256:62eb544a8d3d3644986b3c5384c4d5995015ddb52117d01287f99dcc4b33db52

13.9 MB 2025-08-10T12:34:01Z
llama-b6123-bin-win-cuda-12.4-x64.zip

sha256:875d84be44ad917345e27ce573d1a624243b08d09aa159c83d746157513004ff

139 MB 2025-08-10T12:34:02Z
llama-b6123-bin-win-hip-radeon-x64.zip

sha256:2179efe0a94bbb2953fd8b90ab805184a4f0a39a009eba52ff8d97fce6088a8b

287 MB 2025-08-10T12:34:06Z
llama-b6123-bin-win-opencl-adreno-arm64.zip

sha256:f4248bf422179f78145278e182d89113e69e9b64d4b15d46625940c4337a8da8

11.4 MB 2025-08-10T12:34:13Z
Source code (zip)

2025-08-09T18:29:43Z
Source code (tar.gz)

2025-08-09T18:29:43Z

09 Aug 01:28

github-actions

b6121

e54d41b

b6121

gguf-py : add Numpy MXFP4 de/quantization support (#15111)

* gguf-py : add MXFP4 de/quantization support

* ggml-quants : handle zero amax for MXFP4

Assets 15

07 Aug 11:15

github-actions

b6109

1d72c84

b6109

CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16 (#15131)

* CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16

Assets 15

05 Aug 21:45

github-actions

b6096

fd1234c

b6096

llama : add gpt-oss (#15091)

* oai moe

* compat with new checkpoint

* add attn sink impl

* add rope scaling yarn

* logits match with latest transformers code

* wip chat template

* rm trailing space

* use ggml_scale_bias

* rm redundant is_swa_all

* convert interleaved gate_up

* graph : fix activation function to match reference (#7)

* vocab : handle o200k_harmony special tokens

* ggml : add attention sinks support (#1)

* llama : add attn sinks

* ggml : add attn sinks

* cuda : add attn sinks

* vulkan : add support for sinks in softmax

remove unnecessary return

* ggml : add fused swiglu_oai op (#11)

* ggml : add fused swiglu_oai op

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* update CUDA impl

* cont : metal impl

* add vulkan impl

* test-backend-ops : more test cases, clean up

* llama : remove unfused impl

* remove extra lines

---------

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: slaren <[email protected]>

* repack mxfp4 upon conversion

* clean up a bit

* enable thinking

* add quick hack to render only some special tokens

* fix bf16 conversion

* remove vocab hack

* webui ok

* support chat parsing for gpt-oss

* fix webui

* direct mapping mxfp4, FINALLY

* force using mxfp4

* properly use lazy tensor

* ggml : add mxfp4

ggml : use e8m0 conversion instead of powf

Co-authored-by: Diego Devesa <[email protected]>

change kvalues_mxfp4 table to match e2m1 (#6)

metal : remove quantization for now (not used)

cuda : fix disabled CUDA graphs due to ffn moe bias

vulkan : add support for mxfp4

cont : add cm2 dequant

* ggml : add ggml_add_id (#13)

* ggml : add ggml_add_id

* add cuda impl

* llama : add weight support check for add_id

* perf opt

* add vulkan impl

* rename cuda files

* add metal impl

* allow in-place ggml_add_id

* llama : keep biases on CPU with --cpu-moe

* llama : fix compile error

ggml-ci

* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw

ggml-ci

* cleanup

ggml-ci

* sycl : fix supports_op for MXFP4

ggml-ci

* fix Unknown reasoning format

* ggml-cpu : fix AVX build

ggml-ci

* fix hip build

ggml-ci

* cuda : add mxfp4 dequantization support for cuBLAS

ggml-ci

* ggml-cpu : fix mxfp4 fallback definitions for some architectures

ggml-ci

* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
Co-authored-by: slaren <[email protected]>

Assets 15

04 Aug 06:58

github-actions

b6082

5aa1105

b6082

vulkan: fix build when using glslang that does not support coopmat2 (…

Assets 15

30 Jul 22:25

github-actions

b6039

6e67254

b6039

opencl: add `mul_mat_f32_f32_l4_lm` and `mul_mat_f16_f32_l4_lm` (#14809)

Assets 15

30 Jul 17:16

github-actions

b6037

41e78c5

b6037

server : add support for `embd_normalize` parameter (#14964)

This commit adds support for the `embd_normalize` parameter in the
server code.

The motivation for this is that currently if the server is started with
a pooling type that is not `none`, then Euclidean/L2 normalization will
be the normalization method used for embeddings. However, this is not
always the desired behavior, and users may want to use other
normalization (or none) and this commit allows that.

Example usage:
```console
curl --request POST \
    --url http://localhost:8080/embedding \
    --header "Content-Type: application/json" \
    --data '{"input": "Hello world today", "embd_normalize": -1}
```

Assets 15

29 Jul 09:20

github-actions

b6020

0a5036b

b6020

CUDA: add roll (#14919)

* CUDA: add roll

* Make everything const, use __restrict__

Assets 15

27 Jul 14:28

github-actions

b6005

bf78f54

b6005

vulkan: add ops docs (#14900)

Assets 15

26 Jul 13:16

github-actions

b5996

11dd5a4

b5996

CANN: Implement GLU ops (#14884)

Implement REGLU, GEGLU, SWIGLU ops according to #14158

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: EAddario/llama.cpp

b6123

Uh oh!

b6121

Uh oh!

b6109

Uh oh!

b6096

Uh oh!

b6082

Uh oh!

b6039

Uh oh!

b6037

Uh oh!

b6020

Uh oh!

b6005

Uh oh!

b5996

Uh oh!