ggml : add ggml_set_rows #14274

rgerganov · 2025-06-19T08:10:14Z

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'.

ggerganov · 2025-06-19T16:33:07Z

So far so good: #14285

I think the ggml_set_rows() alone could be a very useful addition since this mechanism can make the llama_kv_cache_unified::find_slot() to search not just for continuous slots of KV cells, but effectively be able to "scatter" the ubatch. This would be a useful improvement, regardless if the graph reuse works or not, so I think we should proceed to implement this operator.

ggml/src/ggml.c

ggml/src/ggml-cpu/ops.cpp

ggerganov · 2025-06-20T17:08:55Z

ggml/src/ggml.c

+struct ggml_tensor * ggml_set_rows(
+        struct ggml_context * ctx,
+        struct ggml_tensor  * a,
+        struct ggml_tensor  * b,
+        struct ggml_tensor  * c) {
+    GGML_ASSERT(b->ne[2] == c->ne[1]);
+    GGML_ASSERT(c->ne[3] == 1);
+    GGML_ASSERT(a->type == GGML_TYPE_F16);
+    GGML_ASSERT(b->type == GGML_TYPE_F32);
+    GGML_ASSERT(c->type == GGML_TYPE_I64);


We might want to allow broadcasting c into b. It would avoid this ggml_repeat_4d here:

llama.cpp/src/llama-kv-cache-unified.cpp

Lines 795 to 799 in a0c0fb6

v_cur = ggml_cont_3d(ctx, v_cur, 1, v_cur->ne[0], v_cur->ne[1]);

kv_idxs = ggml_repeat_4d(ctx, kv_idxs, v_cur->ne[1], v_cur->ne[2], 1, 1);

return ggml_set_rows(ctx, v_view, v_cur, kv_idxs);

ggerganov · 2025-06-22T06:29:42Z

I think we want to support broadcasting like this:

    // a TD  [n_embd, ne01,   ne01_2, ne01_3]
    // b TS  [n_embd, n_rows, ne01_2, ne01_3]
    // c I64 [n_rows, ne21,   ne22,   1]
    //
    // broadcast:
    //   ne01_2 % ne21 == 0
    //   ne01_3 % ne22 == 0
    GGML_API struct ggml_tensor * ggml_set_rows(
            struct ggml_context * ctx,
            struct ggml_tensor  * a,  // destination
            struct ggml_tensor  * b,  // source
            struct ggml_tensor  * c); // row indices

Will try to implement this and open a PR to this branch.

ggerganov · 2025-06-23T06:58:58Z

Will try to implement this and open a PR to this branch.

Opened rgerganov#3

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'. ref: ggml-org#8366

ggml-ci

JohannesGaessler · 2025-06-24T20:50:30Z

Question: why do we need a new ggml op GGML_SET_ROWS? Wouldn't it be possible to create a tensor with GGML_GET_ROWS where instead of a new tensor the output tensor is a view of a?

ggerganov · 2025-06-25T05:41:17Z

We want to be able to set rows randomly - not necessarily contiguously. For example, we might want to set rows 2 5 and 13. Don't see how this can be achieved with GGML_GET_ROWS.

ggerganov · 2025-06-26T07:35:30Z

This operator seems quite useful to me and after the recent experiments in #14285 and #14363 I think we should proceed with adding it. I suggest to merge this PR for now and continue adding the rest of the backend implementations towards master.

ggerganov · 2025-06-26T07:36:21Z

ggml/include/ggml.h

+    // true if the elements in dimension 0 are contiguous, or there is just 1 block of elements
+    GGML_API bool ggml_is_contiguous_rows(const struct ggml_tensor * tensor);
+


Attention here

ggerganov · 2025-06-26T07:36:45Z

ggml/src/ggml-cpu/ggml-cpu.c

@@ -192,6 +192,7 @@ typedef pthread_t ggml_thread_t;

 static const struct ggml_type_traits_cpu type_traits_cpu[GGML_TYPE_COUNT] = {
    [GGML_TYPE_F32] = {
+        .from_float               = (ggml_from_float_t) ggml_cpu_fp32_to_fp32,


Attention here

ggerganov · 2025-06-26T07:37:53Z

ggml/src/ggml-cpu/ops.cpp

+static void ggml_compute_forward_repeat_i64(
+        const ggml_compute_params * params,
+        ggml_tensor * dst) {
+
+    const ggml_tensor * src0 = dst->src[0];


After adding the broadcast support to ggml_set_rows() this is not really needed anymore, but I think it's nice to have either way.

It would be nice to use a template instead of duplicating the code however. We need to start somewhere porting this code to C++.

Ok, will add template in a follow-up PR. For now, removed the i64 support and added TODO.

rgerganov · 2025-06-26T08:01:30Z

It looks like we are hitting actions/runner-images#12435 when building on Windows:

FAILED: [code=1] ggml/src/CMakeFiles/ggml-base.dir/Release/ggml.cpp.obj 
ccache C:\PROGRA~1\LLVM\bin\CLANG_~1.EXE --target=arm64-pc-windows-msvc -DGGML_BUILD -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_CRT_SECURE_NO_WARNINGS -D_XOPEN_SOURCE=600 -Dggml_base_EXPORTS -DCMAKE_INTDIR=\"Release\" -ID:/a/llama.cpp/llama.cpp/ggml/src/. -ID:/a/llama.cpp/llama.cpp/ggml/src/../include -march=armv8.7-a -fvectorize -ffp-model=fast -fno-finite-math-only -Wno-format -Wno-unused-variable -Wno-unused-function -Wno-gnu-zero-variadic-macro-arguments -O3 -DNDEBUG -D_DLL -D_MT -Xclang --dependent-lib=msvcrt -std=gnu++17 -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -MD -MT ggml/src/CMakeFiles/ggml-base.dir/Release/ggml.cpp.obj -MF ggml\src\CMakeFiles\ggml-base.dir\Release\ggml.cpp.obj.d -o ggml/src/CMakeFiles/ggml-base.dir/Release/ggml.cpp.obj -c D:/a/llama.cpp/llama.cpp/ggml/src/ggml.cpp
In file included from D:/a/llama.cpp/llama.cpp/ggml/src/ggml.cpp:1:
In file included from D:/a/llama.cpp/llama.cpp/ggml/src\ggml-impl.h:475:
In file included from C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.44.35207\include\vector:8:
C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.44.35207\include\yvals_core.h:908:1: error: static assertion failed: error STL1000: Unexpected compiler version, expected Clang 19.0.0 or newer.
  908 | _EMIT_STL_ERROR(STL1000, "Unexpected compiler version, expected Clang 19.0.0 or newer.");
      | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.44.35207\include\yvals_core.h:534:58: note: expanded from macro '_EMIT_STL_ERROR'
  534 | #define _EMIT_STL_ERROR(NUMBER, MESSAGE)   static_assert(false, "error " #NUMBER ": " MESSAGE)
      |                                                          ^~~~~
1 error generated.

JohannesGaessler · 2025-06-26T15:02:56Z

Looking at the current CUDA code for copies/type conversions it's kind of a mess. So I'm thinking that it would make sense for me to consolidate the code with a template to cover copies, type conversions, and optionally indices for GET_ROWS and SET_ROWS. I'm noticing that GET_ROWS uses 32 bit indices while SET_ROWS uses 64 bit indices. Is this an intentional choice?

ggerganov · 2025-06-26T16:19:06Z

I'm noticing that GET_ROWS uses 32 bit indices while SET_ROWS uses 64 bit indices. Is this an intentional choice?

Yes, for SET_ROWS it is needed because we will update rows in the KV cache and these can become n_embd*n_head*n_kv*n_seq (for example here).

I think technically GET_ROWS should also start using 64-bit indices, but this is not a priority at the moment as we haven't observed use cases that need it. We can change this later on.

ggml/src/ggml-cpu/ops.cpp

ggml/src/ggml-metal/ggml-metal.m

tests/test-backend-ops.cpp

ggml-ci

rgerganov mentioned this pull request Jun 19, 2025

ggml: avoid rebuild of GGML graph for each token (#7456) #8366

Draft

4 tasks

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 19, 2025

ggerganov mentioned this pull request Jun 19, 2025

kv-cache : use ggml_set_rows #14285

Draft

4 tasks

ggerganov reviewed Jun 19, 2025

View reviewed changes

ggml/src/ggml.c Outdated Show resolved Hide resolved

ggerganov reviewed Jun 20, 2025

View reviewed changes

ggml/src/ggml-cpu/ops.cpp Show resolved Hide resolved

ggerganov reviewed Jun 20, 2025

View reviewed changes

rgerganov force-pushed the ggml-set-rows branch from 45d846a to 70e3d27 Compare June 21, 2025 05:48

github-actions bot added testing Everything test related examples Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jun 23, 2025

rgerganov and others added 12 commits June 23, 2025 12:31

ggml : add ggml_set_rows

3788aa7

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'. ref: ggml-org#8366

use I64 for indices

b2bd0a7

ggml : add repeat impl for i64

788fb0f

ggml : add ggml_is_contiguous_rows

2f3a43d

ggml : ggml_set_rows support broadcast

93a568c

ggml : ggml_set_rows support quantized dst

6b9b86d

ggml-ci

ggml : support GGML_TYPE_F32 ".from_float" trait

b13524c

ggml : ggml_set_rows update comment + better index name

b597019

tests : add ggml_set_rows

f7d0aab

metal : add ggml_set_rows implementation

a881dc2

ggml-ci

ggml : simplify forward_dup_f32

a92973e

ggml : fix supports_op

1a5b2a1

rgerganov force-pushed the ggml-set-rows branch from be77166 to 1a5b2a1 Compare June 23, 2025 09:31

ggerganov mentioned this pull request Jun 24, 2025

llama : add high-throughput mode #14363

Draft

19 tasks

ggerganov marked this pull request as ready for review June 26, 2025 07:30

ggerganov approved these changes Jun 26, 2025

View reviewed changes

ggerganov requested a review from slaren June 26, 2025 07:35

ggerganov reviewed Jun 26, 2025

View reviewed changes

tests : add comment to set_rows

3c33124

slaren approved these changes Jun 26, 2025

View reviewed changes

ggml/src/ggml-cpu/ops.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/ops.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-metal/ggml-metal.m Outdated Show resolved Hide resolved

tests/test-backend-ops.cpp Outdated Show resolved Hide resolved

ggerganov added 5 commits June 27, 2025 11:08

ggml : leave the repeat_i64 for a separate PR

929e118

ggml-ci

ggml : set_rows use std::min instead of MIN

56d6914

ggml : better error message for set_rows unsupported type

8f0f615

metal : perform op->type check only once

838e89d

tests : more consistent implementation + more tests

f46ddba

ggml-ci

	v_cur = ggml_cont_3d(ctx, v_cur, 1, v_cur->ne[0], v_cur->ne[1]);

	kv_idxs = ggml_repeat_4d(ctx, kv_idxs, v_cur->ne[1], v_cur->ne[2], 1, 1);

	return ggml_set_rows(ctx, v_view, v_cur, kv_idxs);

		// true if the elements in dimension 0 are contiguous, or there is just 1 block of elements
		GGML_API bool ggml_is_contiguous_rows(const struct ggml_tensor * tensor);

ggml : add ggml_set_rows #14274

Are you sure you want to change the base?

ggml : add ggml_set_rows #14274

Conversation

rgerganov commented Jun 19, 2025

Uh oh!

ggerganov commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

ggerganov Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Jun 22, 2025

Uh oh!

ggerganov commented Jun 23, 2025

Uh oh!

JohannesGaessler commented Jun 24, 2025

Uh oh!

ggerganov commented Jun 25, 2025

Uh oh!

ggerganov commented Jun 26, 2025

Uh oh!

ggerganov Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

slaren Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

rgerganov commented Jun 26, 2025

Uh oh!

JohannesGaessler commented Jun 26, 2025

Uh oh!

ggerganov commented Jun 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!