-
Notifications
You must be signed in to change notification settings - Fork 49
Description
When calling llama.tokenize() from llama_cpp_dart on a mixed Chinese/English string, the returned token count is significantly smaller than the token count produced by llama-cpp-python using the same GGUF model and text.
For the test case below, Dart returns 48 tokens while Python returns 71 tokens. The Python output matches llama.cpp’s behavior, so it looks like the Dart binding is losing tokens somewhere.
Environment
Platform: android
Architecture: arm64-v8a
llama_cpp_dart version: ^0.1.2+1
Model: bge-m3-q4_k_m.gguf
Dart code (llama_cpp_dart)
final contextParams = ContextParams()..nCtx = 2048;
final llama = Llama(
modelPath,
ModelParams(),
contextParams,
SamplerParams(),
true,
);
const text = """一旦您通過了筆試和路試,考官將拿走你的暫准駕駛執照 (P牌) (綠色)。你的正式駕駛執照 (粉紅色) 將通過郵寄寄到你的家中。They say it can take up to three weeks, but the full licence normally comes within a week.""";
final tokens = llama.tokenize(
text,
true, // addBos
);
print("token count (Dart): ${tokens.length}");
print(tokens);
tokens: [0, 6, 36247, 3479, 20057, 274, 23804, 12324, 264, 3136, 12324, 4, 15922, 13641, 2332, 11790, 3469, 6906, 125037, 27883, 123157, 47808, 8988, 15, 683, 13768, 16, 15, 95828, 14242, 6906, 12622, 123157, 47808, 8988, 15, 14210, 98196, 16, 6, 2332, 20057, 55047, 23636, 23636, 789, 3, 2]
token count (Dart): 48
The last part of the English sentence does not appear to be fully tokenized.
Python code (llama-cpp-python, same GGUF)
from llama_cpp import Llama
model = Llama(
model_path="path/to/same/model.gguf",
embedding=True,
)
text = """一旦您通過了筆試和路試,考官將拿走你的暫准駕駛執照 (P牌) (綠色)。你的正式駕駛執照 (粉紅色) 將通過郵寄寄到你的家中。They say it can take up to three weeks, but the full licence normally comes within a week."""
token_ids = model.tokenize(text.encode("utf-8"))
print("token count (Python):", len(token_ids))
print(token_ids)
Observed result (Python)
tokens: [0, 6, 36247, 3479, 20057, 274, 23804, 12324, 264, 3136, 12324, 4, 15922, 13641, 2332, 11790, 3469, 6906, 125037, 27883, 123157, 47808, 8988, 15, 683, 13768, 16, 15, 95828, 14242, 6906, 12622, 123157, 47808, 8988, 15, 14210, 98196, 16, 6, 2332, 20057, 55047, 23636, 23636, 789, 6906, 79423, 30, 10660, 5154, 442, 831, 5646, 1257, 47, 17262, 40859, 4, 1284, 70, 4393, 132954, 3638, 538, 32497, 28032, 10, 5895, 5, 2]
token count (Python): 71
When detokenized, Python returns the full original string.
Expected behavior
llama.tokenize() in Dart should return the same token sequence length as llama-cpp-python (and llama.cpp) for the same GGUF model and UTF‑8 input text, i.e. around 71 tokens for this test case, with no loss of the tail part of the string.
Additional notes
Changing nCtx does not affect the Dart token count.
The discrepancy appears only in Dart; Python behaves as expected with the same model and text.
This is user-facing in embedding / RAG scenarios where correct token counts are important.