Skip to content

Token loss for llama.tokenize() with mixed Chinese/English text #86

@cwleungar

Description

@cwleungar

When calling llama.tokenize() from llama_cpp_dart on a mixed Chinese/English string, the returned token count is significantly smaller than the token count produced by llama-cpp-python using the same GGUF model and text.

For the test case below, Dart returns 48 tokens while Python returns 71 tokens. The Python output matches llama.cpp’s behavior, so it looks like the Dart binding is losing tokens somewhere.

Environment
Platform: android

Architecture: arm64-v8a

llama_cpp_dart version: ^0.1.2+1

Model: bge-m3-q4_k_m.gguf

Dart code (llama_cpp_dart)

final contextParams = ContextParams()..nCtx = 2048;

final llama = Llama(
  modelPath,
  ModelParams(),
  contextParams,
  SamplerParams(),
  true,
);

const text = """一旦您通過了筆試和路試,考官將拿走你的暫准駕駛執照 (P牌) (綠色)。你的正式駕駛執照 (粉紅色) 將通過郵寄寄到你的家中。They say it can take up to three weeks, but the full licence normally comes within a week.""";

final tokens = llama.tokenize(
  text,
  true, // addBos
);

print("token count (Dart): ${tokens.length}");
print(tokens);

tokens: [0, 6, 36247, 3479, 20057, 274, 23804, 12324, 264, 3136, 12324, 4, 15922, 13641, 2332, 11790, 3469, 6906, 125037, 27883, 123157, 47808, 8988, 15, 683, 13768, 16, 15, 95828, 14242, 6906, 12622, 123157, 47808, 8988, 15, 14210, 98196, 16, 6, 2332, 20057, 55047, 23636, 23636, 789, 3, 2]
token count (Dart): 48

The last part of the English sentence does not appear to be fully tokenized.

Python code (llama-cpp-python, same GGUF)

from llama_cpp import Llama

model = Llama(
    model_path="path/to/same/model.gguf",
    embedding=True,
)

text = """一旦您通過了筆試和路試,考官將拿走你的暫准駕駛執照 (P牌) (綠色)。你的正式駕駛執照 (粉紅色) 將通過郵寄寄到你的家中。They say it can take up to three weeks, but the full licence normally comes within a week."""

token_ids = model.tokenize(text.encode("utf-8"))

print("token count (Python):", len(token_ids))
print(token_ids)

Observed result (Python)

tokens: [0, 6, 36247, 3479, 20057, 274, 23804, 12324, 264, 3136, 12324, 4, 15922, 13641, 2332, 11790, 3469, 6906, 125037, 27883, 123157, 47808, 8988, 15, 683, 13768, 16, 15, 95828, 14242, 6906, 12622, 123157, 47808, 8988, 15, 14210, 98196, 16, 6, 2332, 20057, 55047, 23636, 23636, 789, 6906, 79423, 30, 10660, 5154, 442, 831, 5646, 1257, 47, 17262, 40859, 4, 1284, 70, 4393, 132954, 3638, 538, 32497, 28032, 10, 5895, 5, 2]
token count (Python): 71

When detokenized, Python returns the full original string.

Expected behavior
llama.tokenize() in Dart should return the same token sequence length as llama-cpp-python (and llama.cpp) for the same GGUF model and UTF‑8 input text, i.e. around 71 tokens for this test case, with no loss of the tail part of the string.

Additional notes
Changing nCtx does not affect the Dart token count.

The discrepancy appears only in Dart; Python behaves as expected with the same model and text.

This is user-facing in embedding / RAG scenarios where correct token counts are important.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions