-
Notifications
You must be signed in to change notification settings - Fork 12.1k
ubatch : new splitting logic #14217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
ubatch : new splitting logic #14217
Conversation
ggml-ci
ggml-ci
ggml-ci
166ad5e
to
57c79a9
Compare
ggml-ci
57c79a9
to
d3cb489
Compare
ggml-ci
This breaks shuffled batches for equal splits. When running
But there's also something else which did not happen before:
Seems like multiple sequences with chunked SWA have some inconsistency. |
Does it trigger consistently? It passes on my end:
|
It does on a Pixel 9 Pro in Termux. But it seems like this might not be a regression from here since it also happens in #14139 (sorry, I didn't test that branch on this hardware before).
Reproducing commands: $ git switch compilade/test-model-random
$ mkdir build
$ cd build
$ cmake .. --fresh
$ make -j6 test-model-random
$ ./bin/test-model-random So it's not a problem caused by this PR, sorry for misreporting. (The shuffled batch regression however, is) |
Yes, I reproduce it on my Mac also when I disable Metal, or force |
My best guess is that the summation here overflows FP16: llama.cpp/ggml/src/ggml-cpu/vec.cpp Lines 199 to 219 in 860a9e4
Applying this patch to make the accumulation use F32 (via the leftovers loop) fixes the issue: diff --git a/ggml/src/ggml-cpu/vec.cpp b/ggml/src/ggml-cpu/vec.cpp
index f7614568e..03044d382 100644
--- a/ggml/src/ggml-cpu/vec.cpp
+++ b/ggml/src/ggml-cpu/vec.cpp
@@ -198,7 +198,7 @@ void ggml_vec_dot_f16(int n, float * GGML_RESTRICT s, size_t bs, ggml_fp16_t * G
ggml_float sumf = 0.0;
#if defined(GGML_SIMD)
- const int np = (n & ~(GGML_F16_STEP - 1));
+ const int np = 0;
GGML_F16_VEC sum[GGML_F16_ARR] = { GGML_F16_VEC_ZERO };
The best fix for now is probably to set the KV cache types that the test uses to F32 - this also works.
I'll take a look if this can be handled cleanly, but I'm wondering if this use case is really needed. Do you have any specific applications in mind that require shuffled positions in the input batch? |
That's very likely what I'll end up doing, thanks. (although it's less representative of actual use)
The main benefit is that it makes it really easy to test that sequence aggregation works correctly for proper splitting. If it works with shuffled batches, than it can work with pretty much anything. For an actual use case, I'm not really sure. I'll see how the test can be changed to not affect the relative order within the sequences but still shuffle the relative order of tokens of different sequences. This makes the test a bit harder to implement, though it would be more representative of the expected possible batch orderings (and should probably make the test viable for simple splits too). |
Ok, that would be useful. Regarding the fully shuffled batches, I will add checks for such inputs and raise an error. |
return ubatch_add(idxs, idxs.size(), false); | ||
} | ||
|
||
llama_ubatch llama_batch_allocr::split_equal(uint32_t n_ubatch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that for equal splits, some sequence sets are not compatible (i.e. they can't be put in the same ubatch
). For example, a sequence set containing multiple seq_ids
cannot be mixed with one having a seq_id
in the multi-sequence set.
For example, tokens with seq_ids = { 0, 1, 2, 3 }
are not compatible with tokens in seq_ids = { 1 }
.
The reason is that the recurrent states are only copied to the target sequences on ubatch boundaries, and so dependant tokens cannot be mixed with a shared trunk.
Is this handled here?
Basically the main constraint to check would be that the sequence sets in a ubatch are independent (at least, I think that would be sufficient?).
(Before this PR, it was handled by splitting multi-sequence token groups on their own before the single-sequence tokens)
(I did not implement multi-sequence tests yet in #14139, but that should also be able to answer this question once implemented)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, a sequence set containing multiple seq_ids cannot be mixed with one having a seq_id in the multi-sequence set.
Yes, this logic here at the beginning of the function determines the unique non-overlapping sequence sets that will be contained in this ubatch:
Lines 421 to 446 in 034b055
// determine the non-overlapping sequence sets participating in this ubatch | |
for (int32_t i = 0; i < batch.n_tokens; ++i) { | |
if (used[i]) { | |
continue; | |
} | |
bool add = true; | |
for (uint32_t s = 0; s < cur_seq_set.size(); ++s) { | |
// no overlap with existing sequence sets: | |
if (!(cur_seq_set[s] & seq_set[i]).none()) { | |
add = false; | |
break; | |
} | |
} | |
if (add) { | |
cur_seq_set.push_back(seq_set[i]); | |
if (cur_seq_set.size() > n_ubatch) { | |
break; | |
} | |
} | |
} | |
const uint32_t n_seqs = cur_seq_set.size(); |
@compilade FYI tentative plan is to first merge #13979 and after that to merge this PR (unless you spot some more issues). ETA probably tomorrow. |
llama_sbatch
llama_batch_allocr
now handles ubatch splittingllama_batch_allocr
precomputes various index maps and guarantees the inputs are consistentllama_ubatch
can now iterate over unique sequence idsllama_ubatch.n_seqs
from"number of sequences"
to"number of sequence sets"
n_tokens <= seq_id
. Remove padding hack fromllama-server
TODO: