LumTokenizer 1.0.6.3

A faster, lighter tokenizer library for .NET (.NET 10)

✨ What’s New (2025 Refresh)

Feature	Status	Description
External vocab	✅	Load custom `tokenizer.json` at runtime—no recompilation, no hard-coded maps.
Special tokens	✅	Add special tokens supported int tokenizer.json; keep or hide `<
Chinese-safe optimized	✅	For a better performance.
Unit tests	✅	Added core/robustness/Chinese/English/mixed unit test cases.
Highly efficient	✅	Highly efficient tokenization is achieved through a HighPerformanceSpanSplitter and a span-based dictionary purpose-built for speed.

🚀 Quick Start

1. Install

dotnet add package LumTokenizer

2. The encoding and decoding

var tokenizer = BPETokenizer.CreateTokenizer("minimind_tokenizer.txt"); // Not thread safe.
var ctokenizer = ConcurrentBPETokenizer.CreateTokenizer("minimind_tokenizer.txt"); // Thread safe.

Console.WriteLine(tokenizer.VocabSize);

var ids = tokenizer.Encode("hello!  萤火初芒，你好");

Console.WriteLine(string.Join(",",ids));

Console.WriteLine(tokenizer.Decode(ids));

// Output
// 6400
// 5125,338,3,223,223,1109,100,2399,3187,784,243,270,5134
// hello!  萤火初芒，你好

// More with minimind_tokenizer 6400
// hello friend => 5125,338,2487
// 中国北京=> 2366,6210
// <|im_start|>i'm 好<|im_end|> => 1,75,2115,223,587,2

// More with qwen 151643
// 中国北京 => 58695,68990
// hello friend => 14990,4238
// <|im_start|>i'm 好<|im_end|> => 151644,72,2776,4891,98,121,151645

Comparison

Here is a comparison of token IDs generated by LumTokenizer, SharpToken, and TikTokenr for the same input text using different tokenizers.

LumTokenizer_cl100k_base
34655,61078,11,832,315,42482,596,77069,323,1455,73135,11335,11,10975,279,3446,315,279,46337,323,12280,12970,61078,11,889,65928,813,26135,11,439,568,1587,813,3611,38705,11,4184,311,52671,323,74571,13,61078,753,8060,439,264,7126,2995,360,3933,5678,323,813,1917,304,63355,323,31926,16134,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,29
King Lear, one of Shakespeare's darkest and most savage plays, tells the story of the foolish and Job-like Lear, who divides his kingdom, as he does his affections, according to vanity and whim. Lear’s failure as a father engulfs himself and his world in turmoil and tragedy.<|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|>

SharpToken_cl100k_base
34655,61078,11,832,315,42482,596,77069,323,1455,73135,11335,11,10975,279,3446,315,279,46337,323,12280,12970,61078,11,889,65928,813,26135,11,439,568,1587,813,3611,38705,11,4184,311,52671,323,74571,13,61078,753,8060,439,264,7126,2995,360,3933,5678,323,813,1917,304,63355,323,31926,16134,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,29
King Lear, one of Shakespeare's darkest and most savage plays, tells the story of the foolish and Job-like Lear, who divides his kingdom, as he does his affections, according to vanity and whim. Lear’s failure as a father engulfs himself and his world in turmoil and tragedy.<|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|>

TikTokenr_cl100k_base
34655,61078,11,832,315,42482,596,77069,323,1455,73135,11335,11,10975,279,3446,315,279,46337,323,12280,12970,61078,11,889,65928,813,26135,11,439,568,1587,813,3611,38705,11,4184,311,52671,323,74571,13,61078,753,8060,439,264,7126,2995,360,3933,5678,323,813,1917,304,63355,323,31926,16134,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,29
King Lear, one of Shakespeare's darkest and most savage plays, tells the story of the foolish and Job-like Lear, who divides his kingdom, as he does his affections, according to vanity and whim. Lear’s failure as a father engulfs himself and his world in turmoil and tragedy.<|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|>


LumTokenizer_qwen150k
33555,59978,11,825,315,41382,594,75969,323,1429,72035,11088,11,10742,279,3364,315,279,45237,323,12011,12681,59978,11,879,64828,806,25079,11,438,566,1558,806,3527,37605,11,4092,311,51571,323,73471,13,59978,748,7901,438,264,6981,2922,360,3848,5561,323,806,1879,304,62255,323,30826,13,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645
King Lear, one of Shakespeare's darkest and most savage plays, tells the story of the foolish and Job-like Lear, who divides his kingdom, as he does his affections, according to vanity and whim. Lear’s failure as a father engulfs himself and his world in turmoil and tragedy.<|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|>

Benchmark

Method	text	Mean	Error	StdDev	Ratio	RatioSD	Gen0	Allocated	Alloc Ratio
SharpToken_cl100k_base	人类：请(...)失问题。 [424]	141.46 us	2.779 us	5.354 us	6.24	0.30	0.7324	9.1 KB	1.19
TiktokenSharp_cl100k_base	人类：请(...)失问题。 [424]	101.19 us	1.908 us	1.874 us	4.47	0.16	0.4883	6.34 KB	0.83
LumTokenizer_cl100k_base	人类：请(...)失问题。 [424]	22.68 us	0.458 us	0.671 us	1.00	0.04	0.6104	7.63 KB	1.00
LumTokenizer_concurrent_cl100k_base	人类：请(...)失问题。 [424]	30.28 us	0.580 us	0.690 us	1.34	0.05	3.5706	43.88 KB	5.75

SharpToken_cl100k_base	Huma(...)elp. [1062]	27.45 us	0.544 us	0.745 us	0.93	0.03	0.6714	8.38 KB	0.74
TiktokenSharp_cl100k_base	Huma(...)elp. [1062]	20.66 us	0.401 us	0.613 us	0.70	0.02	0.4272	5.51 KB	0.49
LumTokenizer_cl100k_base	Huma(...)elp. [1062]	29.43 us	0.582 us	0.571 us	1.00	0.03	0.9155	11.31 KB	1.00
LumTokenizer_concurrent_cl100k_base	Huma(...)elp. [1062]	55.34 us	0.268 us	0.238 us	1.88	0.04	7.3242	90.02 KB	7.96

SharpToken_cl100k_base	User(...)的样本。 [628]	79.35 us	1.561 us	2.030 us	3.01	0.08	0.8545	10.9 KB	1.23
TiktokenSharp_cl100k_base	User(...)的样本。 [628]	62.18 us	1.238 us	2.415 us	2.36	0.09	0.4883	6.74 KB	0.76
LumTokenizer_cl100k_base	User(...)的样本。 [628]	26.34 us	0.165 us	0.154 us	1.00	0.01	0.7019	8.83 KB	1.00
LumTokenizer_concurrent_cl100k_base	User(...)的样本。 [628]	41.41 us	0.163 us	0.136 us	1.57	0.01	3.9673	49.05 KB	5.56

Special tokens were consistently ignored under default settings. SharpToken is at version 2.0.4; TiktokenSharp is at 1.2.0. The following is the benchmark code used:

internal class Program
    {
        static void Main(string[] args)
        {           
            BenchmarkRunner.Run<CompareBenchmark>();
        }
    }

    [MemoryDiagnoser]
    public class CompareBenchmark
    {
        internal GptEncoding _sharpToken;
        internal TikToken _tikToken;
        internal BPETokenizer _tokenizer1;
        internal BPETokenizer _tokenizer2;

        [GlobalSetup]
        public void Setup()
        {
            _sharpToken = GptEncoding.GetEncoding("cl100k_base");
            _tikToken = TikToken.GetEncodingAsync("cl100k_base").ConfigureAwait(false).GetAwaiter().GetResult();
            _tokenizer1 = BPETokenizer.CreateTokenizer(
                @"D:\Data\Personal\AI\llm\tokenizer\cl100k.txt", true, RegexType.RegexCl100KBase);
            _tokenizer2 = BPETokenizer.CreateTokenizer(
                @"D:\Data\Personal\AI\llm\tokenizer\qw_tokenizer.json", false, RegexType.RegexCl100KBase);
        }

        // ====== 1. 声明参数源 ======
        public IEnumerable<string> TextSamples()
        {
            yield return TextCatalog.English;
            yield return TextCatalog.Chinese;
            yield return TextCatalog.Mixed;
        }

        // ====== 2. 每个方法改成带参数 ======
        [Benchmark]
        [ArgumentsSource(nameof(TextSamples))]
        public int SharpToken_cl100k_base(string text)
        {
            var encoded = _sharpToken.Encode(text);
            var decoded = _sharpToken.Decode(encoded);
            return encoded.Count;
        }

        [Benchmark]
        [ArgumentsSource(nameof(TextSamples))]
        public int TiktokenSharp_cl100k_base(string text)
        {
            var encoded = _tikToken.Encode(text);
            var decoded = _tikToken.Decode(encoded);
            return encoded.Count;
        }

        [Benchmark(Baseline =true)]
        [ArgumentsSource(nameof(TextSamples))]
        public int LumTokenizer_cl100k_base(string text)
        {
            var encoded = _tokenizer1.Encode(text, false);
            var decoded = _tokenizer1.Decode(encoded, false);
            return encoded.Count;
        }
              
        public int LumTokenizer_qwen150k(string text)
        {
            var encoded = _tokenizer2.Encode(text, false);
            var decoded = _tokenizer2.Decode(encoded, false);
            return encoded.Count;
        }
    }
    public static class TextCatalog
    {
        /* 1 英文长对话 */
        public static readonly string English =
            "Human: Can you explain how gradient descent works in deep learning?\n\n" +
            "Assistant: Sure! Gradient descent is an optimization algorithm used to minimize the loss function. " +
            "The basic idea is to compute the gradient of the loss with respect to each parameter, then update " +
            "the parameters in the opposite direction of the gradient. The learning rate controls the step size. " +
            "There are variants like SGD, momentum, Adam, each improving convergence speed or stability. " +
            "In practice, we use mini-batch gradient descent to balance computational efficiency and convergence. " +
            "The loss landscape can be very high-dimensional and non-convex, so careful tuning of hyper-parameters " +
            "such as learning rate schedules, weight decay, and initialization strategies is essential. " +
            "Without these tricks, training can stall or diverge.\n\n" +
            "Human: What are the common tricks to avoid overfitting?\n\n" +
            "Assistant: Common regularization techniques include dropout, weight decay (L2), early stopping, " +
            "data augmentation, and batch normalization. Increasing dataset size and using simpler models also help.";

        /* 2 纯中文长对话 */
        public static readonly string Chinese =
            "人类：请详细介绍一下 Transformer 的核心思想。\n\n" +
            "助手：Transformer 完全摒弃了递归结构，仅依靠自注意力机制来捕捉序列中的长距离依赖。 " +
            "输入序列首先被映射为查询、键和值三个向量，接着通过缩放点积注意力计算每一位置对其他位置的权重， " +
            "从而在一次前向传播中同时聚合全局信息。多头机制允许模型在不同子空间内并行学习多种关系。 " +
            "此外，位置编码被直接加到词向量上，为模型提供顺序信息。整体结构由编码器和解码器堆叠而成， " +
            "每一层都包含多头自注意力、前馈网络、残差连接和层归一化。该设计大幅提升了训练并行度， " +
            "成为后续 BERT、GPT 系列以及 T5 等模型的基础，推动了预训练加微调的新范式。\n\n" +
            "人类：它与传统 RNN 相比有什么优势？\n\n" +
            "助手：最主要的优势是并行化。RNN 必须依次计算隐藏状态，而 Transformer 可一次性处理整个序列， " +
            "训练速度显著提高。同时，自注意力直接建模任意两位置间的依赖，缓解了长距离梯度消失问题。";

        /* 3 中英混合长对话（无 special token） */
        public static readonly string Mixed =
            "User：最近大模型很火，能不能用简单 English 解释一下 RLHF 是怎么做的？\n\n" +
            "Assistant：RLHF 全称 Reinforcement Learning from Human Feedback，核心流程分三步。 " +
            "第一步，用 supervised fine-tuning 在高质量人工标注数据上微调 base 模型，得到 SFT 模型。 " +
            "第二步，收集同一 prompt 下多个 response 的对比数据，训练一个 reward model 来打分。 " +
            "第三步，用 reinforcement learning（通常是 PPO）继续优化 SFT 模型，把 reward model 的分数作为 reward signal， " +
            "同时加入 KL penalty 防止模型偏离原始分布太远。迭代几轮后，模型就能输出更对齐人类偏好的答案。 " +
            "整个 pipeline 需要大量人工标注和计算资源，但效果上能显著降低 harmful 或 untruthful 输出的概率。\n\n" +
            "User：训练 reward model 时有哪些 tricks？\n\n" +
            "Assistant：常见技巧包括 pair-wise 排序损失、对同一 batch 内样本做 normalization、 " +
            "以及使用 larger batch size 和 lower learning rate 来稳定训练。数据质量比数量更重要， " +
            "需要严格过滤 inconsistent 或恶意标注的样本。";
    }

Contribution

We welcome contributions to LumTokenizer. If you find any issues or have feature requests, please submit them through our issue tracker.

License

LumDb is licensed under the MIT License. See the LICENSE file for more information.

Please enjoy using LumTokenizer and help us make it even better by providing feedback and contributions!

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Benchmark		Benchmark
Benchmark_SpanDictionary		Benchmark_SpanDictionary
LumTokenizer		LumTokenizer
UnitTest		UnitTest
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
LumTokenizer.sln		LumTokenizer.sln
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LumTokenizer 1.0.6.3

✨ What’s New (2025 Refresh)

🚀 Quick Start

1. Install

2. The encoding and decoding

Comparison

Benchmark

Contribution

License

About

Uh oh!

Releases

Packages

Languages

License

LdotJdot/LumTokenizer

Folders and files

Latest commit

History

Repository files navigation

LumTokenizer 1.0.6.3

✨ What’s New (2025 Refresh)

🚀 Quick Start

1. Install

2. The encoding and decoding

Comparison

Benchmark

Contribution

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages