ubatch : new splitting logic #14217

ggerganov · 2025-06-16T15:12:50Z

Remove llama_sbatch
llama_batch_allocr now handles ubatch splitting
llama_batch_allocr precomputes various index maps and guarantees the inputs are consistent
llama_ubatch can now iterate over unique sequence ids
Change notion of llama_ubatch.n_seqs from "number of sequences" to "number of sequence sets"
Enable pooling for n_tokens <= seq_id. Remove padding hack from llama-server
Detailed batch debug output

TODO:

Fix this:

make -j && LLAMA_BATCH_DEBUG=2 ./bin/llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q8_0 --image ~/Downloads/rects.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 1 -ngl 99 --temp 0.0 -c 20000 -b 1

ggml-ci

compilade · 2025-06-17T14:51:39Z

This breaks shuffled batches for equal splits.

When running test-model-random (from #14139) with this I get

Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=512: OK
init: sequence 0 does not start from the last position stored in the memory
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
get_logits_ith: invalid logits id 0, reason: batch.logits[0] != true
/path/to/llama.cpp/tests/test-model-random.cpp:841: GGML_ASSERT(out) failed

But there's also something else which did not happen before:

Comparing output for 'Llama4', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=1: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=2: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=512: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=1: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=2: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=512: OK
Error for seq_id 3 is 0.008624 at n_past=525
Error for seq_id 4 is 0.005619 at n_past=487
Error for seq_id 4 is 0.133501 at n_past=590
Comparing output for 'Llama4', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=1: (40%) FAILED
Error for seq_id 3 is 0.008624 at n_past=525
Error for seq_id 4 is 0.005619 at n_past=487
Error for seq_id 4 is 0.133501 at n_past=590
Comparing output for 'Llama4', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=2: (40%) FAILED
Error for seq_id 3 is 0.008624 at n_past=525
Error for seq_id 4 is 0.005619 at n_past=487
Error for seq_id 4 is 0.133501 at n_past=590
Comparing output for 'Llama4', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=512: (40%) FAILED

Seems like multiple sequences with chunked SWA have some inconsistency.

ggerganov · 2025-06-17T15:50:39Z

But there's also something else which did not happen before:

Does it trigger consistently? It passes on my end:

  gg/ubatch-rework [¡1⇡8]  +1 -1  ~/development/github/llama.cpp/build 
 18:48:42  git show
commit c0df4490c4d6e04ec8e2421fdba2655cbc3d5b44 (HEAD -> gg/ubatch-rework)
Merge: cc7952b42 04b8f5143
Author: Georgi Gerganov <[email protected]>
Date:   Tue Jun 17 18:44:41 2025 +0300
    Merge remote-tracking branch 'origin/compilade/test-model-random' into gg/ubatch-rework
  gg/ubatch-rework [¡1⇡8]  +1 -1  ~/development/github/llama.cpp/build 
 18:48:48  git diff
diff --git a/tests/test-model-random.cpp b/tests/test-model-random.cpp
index 218cfcb82..b5c1d7248 100644
--- a/tests/test-model-random.cpp
+++ b/tests/test-model-random.cpp
@@ -1004,7 +1004,7 @@ int main(int argc, char ** argv) {
                     llama_free(ref_ctx);
                 }
 
-                for (bool shuffle : { false, true }) {
+                for (bool shuffle : { false, }) {
 
                     // skip shuffling the batch for non-recurrent models
                     // (simple splits don't handle shuffled batches correctly)
  gg/ubatch-rework [¡1⇡8]  +1 -1  ~/development/github/llama.cpp/build 
 18:48:57  a=$(make -j > /dev/null) && ./bin/test-model-random
..............
Comparing output for 'Llama2', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=1: OK
Comparing output for 'Llama2', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=2: OK
Comparing output for 'Llama2', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=512: OK
Comparing output for 'Llama2', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=1: OK
Comparing output for 'Llama2', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=2: OK
Comparing output for 'Llama2', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=512: OK
Comparing output for 'Llama2', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=1: OK
Comparing output for 'Llama2', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=2: OK
Comparing output for 'Llama2', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=512: OK
.............................
Comparing output for 'Llama4', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=1: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=2: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=512: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=1: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=2: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=512: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=1: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=2: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=512: OK
................
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=1: OK
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=2: OK
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=512: OK
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=1: OK
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=2: OK
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=512: OK
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=1: OK
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=2: OK
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=512: OK
............
Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=512: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=512: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=512: OK

compilade · 2025-06-17T16:25:39Z

Does it trigger consistently?

It does on a Pixel 9 Pro in Termux. But it seems like this might not be a regression from here since it also happens in #14139 (sorry, I didn't test that branch on this hardware before).

-- ARM feature DOTPROD enabled
-- ARM feature SVE enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native+dotprod+i8mm+sve+nosme

Reproducing commands:

$ git switch compilade/test-model-random
$ mkdir build
$ cd build
$ cmake .. --fresh
$ make -j6 test-model-random
$ ./bin/test-model-random

So it's not a problem caused by this PR, sorry for misreporting.

(The shuffled batch regression however, is)

ggerganov · 2025-06-17T16:39:31Z

Yes, I reproduce it on my Mac also when I disable Metal, or force ngl = 0. So it's very likely a bug in one of the CPU kernel.

ggerganov · 2025-06-17T18:07:58Z

My best guess is that the summation here overflows FP16:

llama.cpp/ggml/src/ggml-cpu/vec.cpp

Lines 199 to 219 in 860a9e4

    
           #if defined(GGML_SIMD) 
        
               const int np = (n & ~(GGML_F16_STEP - 1)); 
        
               GGML_F16_VEC sum[GGML_F16_ARR] = { GGML_F16_VEC_ZERO }; 
        
               GGML_F16_VEC ax[GGML_F16_ARR]; 
        
               GGML_F16_VEC ay[GGML_F16_ARR]; 
        
               for (int i = 0; i < np; i += GGML_F16_STEP) { 
        
                   for (int j = 0; j < GGML_F16_ARR; j++) { 
        
                       ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j); 
        
                       ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j); 
        
                       sum[j] = GGML_F16_VEC_FMA(sum[j], ax[j], ay[j]); 
        
                   } 
        
               } 
        
               // reduce sum0..sum3 to sum0 
        
               GGML_F16_VEC_REDUCE(sumf, sum);

Applying this patch to make the accumulation use F32 (via the leftovers loop) fixes the issue:

diff --git a/ggml/src/ggml-cpu/vec.cpp b/ggml/src/ggml-cpu/vec.cpp
index f7614568e..03044d382 100644
--- a/ggml/src/ggml-cpu/vec.cpp
+++ b/ggml/src/ggml-cpu/vec.cpp
@@ -198,7 +198,7 @@ void ggml_vec_dot_f16(int n, float * GGML_RESTRICT s, size_t bs, ggml_fp16_t * G
     ggml_float sumf = 0.0;
 
 #if defined(GGML_SIMD)
-    const int np = (n & ~(GGML_F16_STEP - 1));
+    const int np = 0;
 
     GGML_F16_VEC sum[GGML_F16_ARR] = { GGML_F16_VEC_ZERO };

The best fix for now is probably to set the KV cache types that the test uses to F32 - this also works.

This breaks shuffled batches for equal splits.

I'll take a look if this can be handled cleanly, but I'm wondering if this use case is really needed. Do you have any specific applications in mind that require shuffled positions in the input batch?

compilade · 2025-06-18T07:07:14Z

The best fix for now is probably to set the KV cache types that the test uses to F32 - this also works.

That's very likely what I'll end up doing, thanks. (although it's less representative of actual use)

Do you have any specific applications in mind that require shuffled positions in the input batch?

The main benefit is that it makes it really easy to test that sequence aggregation works correctly for proper splitting. If it works with shuffled batches, than it can work with pretty much anything.

For an actual use case, I'm not really sure.

I'll see how the test can be changed to not affect the relative order within the sequences but still shuffle the relative order of tokens of different sequences. This makes the test a bit harder to implement, though it would be more representative of the expected possible batch orderings (and should probably make the test viable for simple splits too).

ggerganov · 2025-06-18T07:20:54Z

I'll see how the test can be changed to not affect the relative order within the sequences but still shuffle the relative order of tokens of different sequences.

Ok, that would be useful. Regarding the fully shuffled batches, I will add checks for such inputs and raise an error.

compilade · 2025-06-18T07:25:35Z

src/llama-batch.cpp

+    return ubatch_add(idxs, idxs.size(), false);
+}
+
+llama_ubatch llama_batch_allocr::split_equal(uint32_t n_ubatch) {


Note that for equal splits, some sequence sets are not compatible (i.e. they can't be put in the same ubatch). For example, a sequence set containing multiple seq_ids cannot be mixed with one having a seq_id in the multi-sequence set.

For example, tokens with seq_ids = { 0, 1, 2, 3 } are not compatible with tokens in seq_ids = { 1 }.

The reason is that the recurrent states are only copied to the target sequences on ubatch boundaries, and so dependant tokens cannot be mixed with a shared trunk.

Is this handled here?

Basically the main constraint to check would be that the sequence sets in a ubatch are independent (at least, I think that would be sufficient?).

(Before this PR, it was handled by splitting multi-sequence token groups on their own before the single-sequence tokens)

(I did not implement multi-sequence tests yet in #14139, but that should also be able to answer this question once implemented)

For example, a sequence set containing multiple seq_ids cannot be mixed with one having a seq_id in the multi-sequence set.

Yes, this logic here at the beginning of the function determines the unique non-overlapping sequence sets that will be contained in this ubatch:

llama.cpp/src/llama-batch.cpp

Lines 421 to 446 in 034b055

// determine the non-overlapping sequence sets participating in this ubatch

for (int32_t i = 0; i < batch.n_tokens; ++i) {

if (used[i]) {

continue;

}

bool add = true;

for (uint32_t s = 0; s < cur_seq_set.size(); ++s) {

// no overlap with existing sequence sets:

if (!(cur_seq_set[s] & seq_set[i]).none()) {

add = false;

break;

}

}

if (add) {

cur_seq_set.push_back(seq_set[i]);

if (cur_seq_set.size() > n_ubatch) {

break;

}

}

}

const uint32_t n_seqs = cur_seq_set.size();

ggml-ci

ggerganov · 2025-06-18T13:23:29Z

@compilade FYI tentative plan is to first merge #13979 and after that to merge this PR (unless you spot some more issues). ETA probably tomorrow.

ggerganov added 6 commits June 17, 2025 09:05

ubatch : new splitting logic (wip)

c4faef3

ggml-ci

cont

9eac0c8

ggml-ci

cont : bug fix

b2cfe7e

ggml-ci

cont : use batch allocr for state restore

3c51403

ggml-ci

cont : remove llama_sbatch

060b5b2

ggml-ci

cont : log ubatches

7eb5ad4

ggml-ci

ggerganov force-pushed the gg/ubatch-rework branch from 166ad5e to 57c79a9 Compare June 17, 2025 08:41

github-actions bot added examples server labels Jun 17, 2025

cont : rework pooling

d3cb489

ggml-ci

ggerganov force-pushed the gg/ubatch-rework branch from 57c79a9 to d3cb489 Compare June 17, 2025 08:56

ggerganov added 4 commits June 17, 2025 13:21

cont : fix

1f6a916

ggml-ci

batch : rename batch_allocr to balloc

e6ac4ac

ggml-ci

cont : uniform ubatch indexing

28dec76

ggml-ci

cont : more consistent indexing in recurrent cache

6a50f45

ggml-ci

ggerganov marked this pull request as ready for review June 17, 2025 12:36

ggerganov requested a review from ngxson as a code owner June 17, 2025 12:36

ggerganov requested a review from compilade June 17, 2025 12:36

ggerganov added 2 commits June 17, 2025 16:10

cont : add comments

4fc4b0d

cont : remove outdated comment

cc7952b

cont : clean-up debug logs

034b055

ggerganov mentioned this pull request Jun 18, 2025

Hybrid recurrent cache #13979

Open

compilade reviewed Jun 18, 2025

View reviewed changes

cont : fix comments

711d195

ggerganov mentioned this pull request Jun 18, 2025

Eval bug: Qwen2.5-VL-7B-Instruct returns extremely inaccurate bbox coordinates #13694

Open

ggerganov added 2 commits June 18, 2025 15:23

cont : fix Qwen VL multi-pos input

7286558

ggml-ci

cont : assert that sequence positions are not decreasing

1cfb8bb

ggml-ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ubatch : new splitting logic #14217

ubatch : new splitting logic #14217

ggerganov commented Jun 16, 2025 •

edited

Loading

Uh oh!

compilade commented Jun 17, 2025 •

edited

Loading

Uh oh!

ggerganov commented Jun 17, 2025

Uh oh!

compilade commented Jun 17, 2025 •

edited

Loading

Uh oh!

ggerganov commented Jun 17, 2025

Uh oh!

ggerganov commented Jun 17, 2025 •

edited

Loading

Uh oh!

compilade commented Jun 18, 2025

Uh oh!

ggerganov commented Jun 18, 2025

Uh oh!

compilade Jun 18, 2025 •

edited

Loading

Uh oh!

ggerganov Jun 18, 2025

Uh oh!

ggerganov commented Jun 18, 2025

Uh oh!

Uh oh!

	// determine the non-overlapping sequence sets participating in this ubatch
	for (int32_t i = 0; i < batch.n_tokens; ++i) {
	if (used[i]) {
	continue;
	}

	bool add = true;

	for (uint32_t s = 0; s < cur_seq_set.size(); ++s) {
	// no overlap with existing sequence sets:
	if (!(cur_seq_set[s] & seq_set[i]).none()) {
	add = false;
	break;
	}
	}

	if (add) {
	cur_seq_set.push_back(seq_set[i]);

	if (cur_seq_set.size() > n_ubatch) {
	break;
	}
	}
	}

	const uint32_t n_seqs = cur_seq_set.size();

ubatch : new splitting logic #14217

Are you sure you want to change the base?

ubatch : new splitting logic #14217

Conversation

ggerganov commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

compilade commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jun 17, 2025

Uh oh!

compilade commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jun 17, 2025

Uh oh!

ggerganov commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

compilade commented Jun 18, 2025

Uh oh!

ggerganov commented Jun 18, 2025

Uh oh!

compilade Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Jun 18, 2025

Uh oh!

Uh oh!

ggerganov commented Jun 16, 2025 •

edited

Loading

compilade commented Jun 17, 2025 •

edited

Loading

compilade commented Jun 17, 2025 •

edited

Loading

ggerganov commented Jun 17, 2025 •

edited

Loading

compilade Jun 18, 2025 •

edited

Loading