Skip to content

ggml : add ggml_set_rows #14274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: master
Choose a base branch
from
Open

Conversation

rgerganov
Copy link
Collaborator

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'.

ref: #8366

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 19, 2025
@ggerganov ggerganov mentioned this pull request Jun 19, 2025
4 tasks
@ggerganov
Copy link
Member

So far so good: #14285

I think the ggml_set_rows() alone could be a very useful addition since this mechanism can make the llama_kv_cache_unified::find_slot() to search not just for continuous slots of KV cells, but effectively be able to "scatter" the ubatch. This would be a useful improvement, regardless if the graph reuse works or not, so I think we should proceed to implement this operator.

ggml/src/ggml.c Outdated
Comment on lines 3400 to 3421
struct ggml_tensor * ggml_set_rows(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
struct ggml_tensor * c) {
GGML_ASSERT(b->ne[2] == c->ne[1]);
GGML_ASSERT(c->ne[3] == 1);
GGML_ASSERT(a->type == GGML_TYPE_F16);
GGML_ASSERT(b->type == GGML_TYPE_F32);
GGML_ASSERT(c->type == GGML_TYPE_I64);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to allow broadcasting c into b. It would avoid this ggml_repeat_4d here:

v_cur = ggml_cont_3d(ctx, v_cur, 1, v_cur->ne[0], v_cur->ne[1]);
kv_idxs = ggml_repeat_4d(ctx, kv_idxs, v_cur->ne[1], v_cur->ne[2], 1, 1);
return ggml_set_rows(ctx, v_view, v_cur, kv_idxs);

@ggerganov
Copy link
Member

I think we want to support broadcasting like this:

    // a TD  [n_embd, ne01,   ne01_2, ne01_3]
    // b TS  [n_embd, n_rows, ne01_2, ne01_3]
    // c I64 [n_rows, ne21,   ne22,   1]
    //
    // broadcast:
    //   ne01_2 % ne21 == 0
    //   ne01_3 % ne22 == 0
    GGML_API struct ggml_tensor * ggml_set_rows(
            struct ggml_context * ctx,
            struct ggml_tensor  * a,  // destination
            struct ggml_tensor  * b,  // source
            struct ggml_tensor  * c); // row indices

Will try to implement this and open a PR to this branch.

@ggerganov
Copy link
Member

Will try to implement this and open a PR to this branch.

Opened rgerganov#3

@github-actions github-actions bot added testing Everything test related examples Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jun 23, 2025
@JohannesGaessler
Copy link
Collaborator

Question: why do we need a new ggml op GGML_SET_ROWS? Wouldn't it be possible to create a tensor with GGML_GET_ROWS where instead of a new tensor the output tensor is a view of a?

@ggerganov
Copy link
Member

We want to be able to set rows randomly - not necessarily contiguously. For example, we might want to set rows 2 5 and 13. Don't see how this can be achieved with GGML_GET_ROWS.

@ggerganov ggerganov marked this pull request as ready for review June 26, 2025 07:30
@ggerganov
Copy link
Member

This operator seems quite useful to me and after the recent experiments in #14285 and #14363 I think we should proceed with adding it. I suggest to merge this PR for now and continue adding the rest of the backend implementations towards master.

@ggerganov ggerganov requested a review from slaren June 26, 2025 07:35
Comment on lines +691 to +693
// true if the elements in dimension 0 are contiguous, or there is just 1 block of elements
GGML_API bool ggml_is_contiguous_rows(const struct ggml_tensor * tensor);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attention here

@@ -192,6 +192,7 @@ typedef pthread_t ggml_thread_t;

static const struct ggml_type_traits_cpu type_traits_cpu[GGML_TYPE_COUNT] = {
[GGML_TYPE_F32] = {
.from_float = (ggml_from_float_t) ggml_cpu_fp32_to_fp32,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attention here

Comment on lines 2269 to 2273
static void ggml_compute_forward_repeat_i64(
const ggml_compute_params * params,
ggml_tensor * dst) {

const ggml_tensor * src0 = dst->src[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After adding the broadcast support to ggml_set_rows() this is not really needed anymore, but I think it's nice to have either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to use a template instead of duplicating the code however. We need to start somewhere porting this code to C++.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will add template in a follow-up PR. For now, removed the i64 support and added TODO.

@rgerganov
Copy link
Collaborator Author

It looks like we are hitting actions/runner-images#12435 when building on Windows:

FAILED: [code=1] ggml/src/CMakeFiles/ggml-base.dir/Release/ggml.cpp.obj 
ccache C:\PROGRA~1\LLVM\bin\CLANG_~1.EXE --target=arm64-pc-windows-msvc -DGGML_BUILD -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_CRT_SECURE_NO_WARNINGS -D_XOPEN_SOURCE=600 -Dggml_base_EXPORTS -DCMAKE_INTDIR=\"Release\" -ID:/a/llama.cpp/llama.cpp/ggml/src/. -ID:/a/llama.cpp/llama.cpp/ggml/src/../include -march=armv8.7-a -fvectorize -ffp-model=fast -fno-finite-math-only -Wno-format -Wno-unused-variable -Wno-unused-function -Wno-gnu-zero-variadic-macro-arguments -O3 -DNDEBUG -D_DLL -D_MT -Xclang --dependent-lib=msvcrt -std=gnu++17 -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -MD -MT ggml/src/CMakeFiles/ggml-base.dir/Release/ggml.cpp.obj -MF ggml\src\CMakeFiles\ggml-base.dir\Release\ggml.cpp.obj.d -o ggml/src/CMakeFiles/ggml-base.dir/Release/ggml.cpp.obj -c D:/a/llama.cpp/llama.cpp/ggml/src/ggml.cpp
In file included from D:/a/llama.cpp/llama.cpp/ggml/src/ggml.cpp:1:
In file included from D:/a/llama.cpp/llama.cpp/ggml/src\ggml-impl.h:475:
In file included from C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.44.35207\include\vector:8:
C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.44.35207\include\yvals_core.h:908:1: error: static assertion failed: error STL1000: Unexpected compiler version, expected Clang 19.0.0 or newer.
  908 | _EMIT_STL_ERROR(STL1000, "Unexpected compiler version, expected Clang 19.0.0 or newer.");
      | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.44.35207\include\yvals_core.h:534:58: note: expanded from macro '_EMIT_STL_ERROR'
  534 | #define _EMIT_STL_ERROR(NUMBER, MESSAGE)   static_assert(false, "error " #NUMBER ": " MESSAGE)
      |                                                          ^~~~~
1 error generated.

@JohannesGaessler
Copy link
Collaborator

Looking at the current CUDA code for copies/type conversions it's kind of a mess. So I'm thinking that it would make sense for me to consolidate the code with a template to cover copies, type conversions, and optionally indices for GET_ROWS and SET_ROWS. I'm noticing that GET_ROWS uses 32 bit indices while SET_ROWS uses 64 bit indices. Is this an intentional choice?

@ggerganov
Copy link
Member

I'm noticing that GET_ROWS uses 32 bit indices while SET_ROWS uses 64 bit indices. Is this an intentional choice?

Yes, for SET_ROWS it is needed because we will update rows in the KV cache and these can become n_embd*n_head*n_kv*n_seq (for example here).

I think technically GET_ROWS should also start using 64-bit indices, but this is not a priority at the moment as we haven't observed use cases that need it. We can change this later on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants