A faster, lighter tokenizer library for .NET (.NET 10)
| Feature | Status | Description |
|---|---|---|
| External vocab | ✅ | Load custom tokenizer.json at runtime—no recompilation, no hard-coded maps. |
| Special tokens | ✅ | Add special tokens supported int tokenizer.json; keep or hide `< |
| Chinese-safe optimized | ✅ | For a better performance. |
| Unit tests | ✅ | Added core/robustness/Chinese/English/mixed unit test cases. |
| Highly efficient | ✅ | Highly efficient tokenization is achieved through a HighPerformanceSpanSplitter and a span-based dictionary purpose-built for speed. |
dotnet add package LumTokenizervar tokenizer = BPETokenizer.CreateTokenizer("minimind_tokenizer.txt"); // Not thread safe.
var ctokenizer = ConcurrentBPETokenizer.CreateTokenizer("minimind_tokenizer.txt"); // Thread safe.
Console.WriteLine(tokenizer.VocabSize);
var ids = tokenizer.Encode("hello! 萤火初芒,你好");
Console.WriteLine(string.Join(",",ids));
Console.WriteLine(tokenizer.Decode(ids));
// Output
// 6400
// 5125,338,3,223,223,1109,100,2399,3187,784,243,270,5134
// hello! 萤火初芒,你好
// More with minimind_tokenizer 6400
// hello friend => 5125,338,2487
// 中国北京=> 2366,6210
// <|im_start|>i'm 好<|im_end|> => 1,75,2115,223,587,2
// More with qwen 151643
// 中国北京 => 58695,68990
// hello friend => 14990,4238
// <|im_start|>i'm 好<|im_end|> => 151644,72,2776,4891,98,121,151645Here is a comparison of token IDs generated by LumTokenizer, SharpToken, and TikTokenr for the same input text using different tokenizers.
LumTokenizer_cl100k_base
34655,61078,11,832,315,42482,596,77069,323,1455,73135,11335,11,10975,279,3446,315,279,46337,323,12280,12970,61078,11,889,65928,813,26135,11,439,568,1587,813,3611,38705,11,4184,311,52671,323,74571,13,61078,753,8060,439,264,7126,2995,360,3933,5678,323,813,1917,304,63355,323,31926,16134,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,29
King Lear, one of Shakespeare's darkest and most savage plays, tells the story of the foolish and Job-like Lear, who divides his kingdom, as he does his affections, according to vanity and whim. Lear’s failure as a father engulfs himself and his world in turmoil and tragedy.<|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|>
SharpToken_cl100k_base
34655,61078,11,832,315,42482,596,77069,323,1455,73135,11335,11,10975,279,3446,315,279,46337,323,12280,12970,61078,11,889,65928,813,26135,11,439,568,1587,813,3611,38705,11,4184,311,52671,323,74571,13,61078,753,8060,439,264,7126,2995,360,3933,5678,323,813,1917,304,63355,323,31926,16134,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,29
King Lear, one of Shakespeare's darkest and most savage plays, tells the story of the foolish and Job-like Lear, who divides his kingdom, as he does his affections, according to vanity and whim. Lear’s failure as a father engulfs himself and his world in turmoil and tragedy.<|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|>
TikTokenr_cl100k_base
34655,61078,11,832,315,42482,596,77069,323,1455,73135,11335,11,10975,279,3446,315,279,46337,323,12280,12970,61078,11,889,65928,813,26135,11,439,568,1587,813,3611,38705,11,4184,311,52671,323,74571,13,61078,753,8060,439,264,7126,2995,360,3933,5678,323,813,1917,304,63355,323,31926,16134,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,29
King Lear, one of Shakespeare's darkest and most savage plays, tells the story of the foolish and Job-like Lear, who divides his kingdom, as he does his affections, according to vanity and whim. Lear’s failure as a father engulfs himself and his world in turmoil and tragedy.<|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|>
LumTokenizer_qwen150k
33555,59978,11,825,315,41382,594,75969,323,1429,72035,11088,11,10742,279,3364,315,279,45237,323,12011,12681,59978,11,879,64828,806,25079,11,438,566,1558,806,3527,37605,11,4092,311,51571,323,73471,13,59978,748,7901,438,264,6981,2922,360,3848,5561,323,806,1879,304,62255,323,30826,13,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645
King Lear, one of Shakespeare's darkest and most savage plays, tells the story of the foolish and Job-like Lear, who divides his kingdom, as he does his affections, according to vanity and whim. Lear’s failure as a father engulfs himself and his world in turmoil and tragedy.<|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|><|im_start|>hello 你好<|im_end|>
| Method | text | Mean | Error | StdDev | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio |
|---|---|---|---|---|---|---|---|---|---|
| SharpToken_cl100k_base | 人类:请(...)失问题。 [424] | 141.46 us | 2.779 us | 5.354 us | 6.24 | 0.30 | 0.7324 | 9.1 KB | 1.19 |
| TiktokenSharp_cl100k_base | 人类:请(...)失问题。 [424] | 101.19 us | 1.908 us | 1.874 us | 4.47 | 0.16 | 0.4883 | 6.34 KB | 0.83 |
| LumTokenizer_cl100k_base | 人类:请(...)失问题。 [424] | 22.68 us | 0.458 us | 0.671 us | 1.00 | 0.04 | 0.6104 | 7.63 KB | 1.00 |
| LumTokenizer_concurrent_cl100k_base | 人类:请(...)失问题。 [424] | 30.28 us | 0.580 us | 0.690 us | 1.34 | 0.05 | 3.5706 | 43.88 KB | 5.75 |
| SharpToken_cl100k_base | Huma(...)elp. [1062] | 27.45 us | 0.544 us | 0.745 us | 0.93 | 0.03 | 0.6714 | 8.38 KB | 0.74 |
| TiktokenSharp_cl100k_base | Huma(...)elp. [1062] | 20.66 us | 0.401 us | 0.613 us | 0.70 | 0.02 | 0.4272 | 5.51 KB | 0.49 |
| LumTokenizer_cl100k_base | Huma(...)elp. [1062] | 29.43 us | 0.582 us | 0.571 us | 1.00 | 0.03 | 0.9155 | 11.31 KB | 1.00 |
| LumTokenizer_concurrent_cl100k_base | Huma(...)elp. [1062] | 55.34 us | 0.268 us | 0.238 us | 1.88 | 0.04 | 7.3242 | 90.02 KB | 7.96 |
| SharpToken_cl100k_base | User(...)的样本。 [628] | 79.35 us | 1.561 us | 2.030 us | 3.01 | 0.08 | 0.8545 | 10.9 KB | 1.23 |
| TiktokenSharp_cl100k_base | User(...)的样本。 [628] | 62.18 us | 1.238 us | 2.415 us | 2.36 | 0.09 | 0.4883 | 6.74 KB | 0.76 |
| LumTokenizer_cl100k_base | User(...)的样本。 [628] | 26.34 us | 0.165 us | 0.154 us | 1.00 | 0.01 | 0.7019 | 8.83 KB | 1.00 |
| LumTokenizer_concurrent_cl100k_base | User(...)的样本。 [628] | 41.41 us | 0.163 us | 0.136 us | 1.57 | 0.01 | 3.9673 | 49.05 KB | 5.56 |
Special tokens were consistently ignored under default settings. SharpToken is at version 2.0.4; TiktokenSharp is at 1.2.0. The following is the benchmark code used:
internal class Program
{
static void Main(string[] args)
{
BenchmarkRunner.Run<CompareBenchmark>();
}
}
[MemoryDiagnoser]
public class CompareBenchmark
{
internal GptEncoding _sharpToken;
internal TikToken _tikToken;
internal BPETokenizer _tokenizer1;
internal BPETokenizer _tokenizer2;
[GlobalSetup]
public void Setup()
{
_sharpToken = GptEncoding.GetEncoding("cl100k_base");
_tikToken = TikToken.GetEncodingAsync("cl100k_base").ConfigureAwait(false).GetAwaiter().GetResult();
_tokenizer1 = BPETokenizer.CreateTokenizer(
@"D:\Data\Personal\AI\llm\tokenizer\cl100k.txt", true, RegexType.RegexCl100KBase);
_tokenizer2 = BPETokenizer.CreateTokenizer(
@"D:\Data\Personal\AI\llm\tokenizer\qw_tokenizer.json", false, RegexType.RegexCl100KBase);
}
// ====== 1. 声明参数源 ======
public IEnumerable<string> TextSamples()
{
yield return TextCatalog.English;
yield return TextCatalog.Chinese;
yield return TextCatalog.Mixed;
}
// ====== 2. 每个方法改成带参数 ======
[Benchmark]
[ArgumentsSource(nameof(TextSamples))]
public int SharpToken_cl100k_base(string text)
{
var encoded = _sharpToken.Encode(text);
var decoded = _sharpToken.Decode(encoded);
return encoded.Count;
}
[Benchmark]
[ArgumentsSource(nameof(TextSamples))]
public int TiktokenSharp_cl100k_base(string text)
{
var encoded = _tikToken.Encode(text);
var decoded = _tikToken.Decode(encoded);
return encoded.Count;
}
[Benchmark(Baseline =true)]
[ArgumentsSource(nameof(TextSamples))]
public int LumTokenizer_cl100k_base(string text)
{
var encoded = _tokenizer1.Encode(text, false);
var decoded = _tokenizer1.Decode(encoded, false);
return encoded.Count;
}
public int LumTokenizer_qwen150k(string text)
{
var encoded = _tokenizer2.Encode(text, false);
var decoded = _tokenizer2.Decode(encoded, false);
return encoded.Count;
}
}
public static class TextCatalog
{
/* 1 英文长对话 */
public static readonly string English =
"Human: Can you explain how gradient descent works in deep learning?\n\n" +
"Assistant: Sure! Gradient descent is an optimization algorithm used to minimize the loss function. " +
"The basic idea is to compute the gradient of the loss with respect to each parameter, then update " +
"the parameters in the opposite direction of the gradient. The learning rate controls the step size. " +
"There are variants like SGD, momentum, Adam, each improving convergence speed or stability. " +
"In practice, we use mini-batch gradient descent to balance computational efficiency and convergence. " +
"The loss landscape can be very high-dimensional and non-convex, so careful tuning of hyper-parameters " +
"such as learning rate schedules, weight decay, and initialization strategies is essential. " +
"Without these tricks, training can stall or diverge.\n\n" +
"Human: What are the common tricks to avoid overfitting?\n\n" +
"Assistant: Common regularization techniques include dropout, weight decay (L2), early stopping, " +
"data augmentation, and batch normalization. Increasing dataset size and using simpler models also help.";
/* 2 纯中文长对话 */
public static readonly string Chinese =
"人类:请详细介绍一下 Transformer 的核心思想。\n\n" +
"助手:Transformer 完全摒弃了递归结构,仅依靠自注意力机制来捕捉序列中的长距离依赖。 " +
"输入序列首先被映射为查询、键和值三个向量,接着通过缩放点积注意力计算每一位置对其他位置的权重, " +
"从而在一次前向传播中同时聚合全局信息。多头机制允许模型在不同子空间内并行学习多种关系。 " +
"此外,位置编码被直接加到词向量上,为模型提供顺序信息。整体结构由编码器和解码器堆叠而成, " +
"每一层都包含多头自注意力、前馈网络、残差连接和层归一化。该设计大幅提升了训练并行度, " +
"成为后续 BERT、GPT 系列以及 T5 等模型的基础,推动了预训练加微调的新范式。\n\n" +
"人类:它与传统 RNN 相比有什么优势?\n\n" +
"助手:最主要的优势是并行化。RNN 必须依次计算隐藏状态,而 Transformer 可一次性处理整个序列, " +
"训练速度显著提高。同时,自注意力直接建模任意两位置间的依赖,缓解了长距离梯度消失问题。";
/* 3 中英混合长对话(无 special token) */
public static readonly string Mixed =
"User:最近大模型很火,能不能用简单 English 解释一下 RLHF 是怎么做的?\n\n" +
"Assistant:RLHF 全称 Reinforcement Learning from Human Feedback,核心流程分三步。 " +
"第一步,用 supervised fine-tuning 在高质量人工标注数据上微调 base 模型,得到 SFT 模型。 " +
"第二步,收集同一 prompt 下多个 response 的对比数据,训练一个 reward model 来打分。 " +
"第三步,用 reinforcement learning(通常是 PPO)继续优化 SFT 模型,把 reward model 的分数作为 reward signal, " +
"同时加入 KL penalty 防止模型偏离原始分布太远。迭代几轮后,模型就能输出更对齐人类偏好的答案。 " +
"整个 pipeline 需要大量人工标注和计算资源,但效果上能显著降低 harmful 或 untruthful 输出的概率。\n\n" +
"User:训练 reward model 时有哪些 tricks?\n\n" +
"Assistant:常见技巧包括 pair-wise 排序损失、对同一 batch 内样本做 normalization、 " +
"以及使用 larger batch size 和 lower learning rate 来稳定训练。数据质量比数量更重要, " +
"需要严格过滤 inconsistent 或恶意标注的样本。";
}We welcome contributions to LumTokenizer. If you find any issues or have feature requests, please submit them through our issue tracker.
LumDb is licensed under the MIT License. See the LICENSE file for more information.
Please enjoy using LumTokenizer and help us make it even better by providing feedback and contributions!