|
| 1 | +--- |
| 2 | +title: Use Microsoft.ML.Tokenizers for text tokenization |
| 3 | +description: Learn how to use the Microsoft.ML.Tokenizers library to tokenize text for AI models, manage token counts, and work with various tokenization algorithms. |
| 4 | +ms.topic: how-to |
| 5 | +ms.date: 10/29/2025 |
| 6 | +ai-usage: ai-assisted |
| 7 | +--- |
| 8 | +# Use Microsoft.ML.Tokenizers for text tokenization |
| 9 | + |
| 10 | +The [Microsoft.ML.Tokenizers](https://www.nuget.org/packages/Microsoft.ML.Tokenizers) library provides a comprehensive set of tools for tokenizing text in .NET applications. Tokenization is essential when you work with large language models (LLMs), as it allows you to manage token counts, estimate costs, and preprocess text for AI models. |
| 11 | + |
| 12 | +This article shows you how to use the library's key features and work with different tokenizer models. |
| 13 | + |
| 14 | +## Prerequisites |
| 15 | + |
| 16 | +- [.NET 8 SDK](https://dotnet.microsoft.com/download/dotnet/8.0) or later |
| 17 | + |
| 18 | +> [!NOTE] |
| 19 | +> The Microsoft.ML.Tokenizers library also supports .NET Standard 2.0, making it compatible with .NET Framework 4.6.1 and later. |
| 20 | +
|
| 21 | +## Install the package |
| 22 | + |
| 23 | +Install the Microsoft.ML.Tokenizers NuGet package: |
| 24 | + |
| 25 | +```dotnetcli |
| 26 | +dotnet add package Microsoft.ML.Tokenizers |
| 27 | +``` |
| 28 | + |
| 29 | +For Tiktoken models (like GPT-4), you also need to install the corresponding data package: |
| 30 | + |
| 31 | +```dotnetcli |
| 32 | +dotnet add package Microsoft.ML.Tokenizers.Data.O200kBase |
| 33 | +``` |
| 34 | + |
| 35 | +## Key features |
| 36 | + |
| 37 | +The Microsoft.ML.Tokenizers library provides: |
| 38 | + |
| 39 | +- **Extensible tokenizer architecture**: Allows specialization of Normalizer, PreTokenizer, Model/Encoder, and Decoder components. |
| 40 | +- **Multiple tokenization algorithms**: Supports BPE (byte-pair encoding), Tiktoken, Llama, CodeGen, and more. |
| 41 | +- **Token counting and estimation**: Helps manage costs and context limits when working with AI services. |
| 42 | +- **Flexible encoding options**: Provides methods to encode text to token IDs, count tokens, and decode tokens back to text. |
| 43 | + |
| 44 | +## Use Tiktoken tokenizer |
| 45 | + |
| 46 | +The Tiktoken tokenizer is commonly used with OpenAI models like GPT-4. The following example shows how to initialize a Tiktoken tokenizer and perform common operations: |
| 47 | + |
| 48 | +:::code language="csharp" source="./snippets/use-tokenizers/csharp/TokenizersExamples/TiktokenExample.cs" id="TiktokenBasic"::: |
| 49 | + |
| 50 | +For better performance, you should cache and reuse the tokenizer instance throughout your app. |
| 51 | + |
| 52 | +When you work with LLMs, you often need to manage text within token limits. The following example shows how to trim text to a specific token count: |
| 53 | + |
| 54 | +:::code language="csharp" source="./snippets/use-tokenizers/csharp/TokenizersExamples/TiktokenExample.cs" id="TiktokenTrim"::: |
| 55 | + |
| 56 | +## Use Llama tokenizer |
| 57 | + |
| 58 | +The Llama tokenizer is designed for the Llama family of models. It requires a tokenizer model file, which you can download from model repositories like Hugging Face: |
| 59 | + |
| 60 | +:::code language="csharp" source="./snippets/use-tokenizers/csharp/TokenizersExamples/LlamaExample.cs" id="LlamaBasic"::: |
| 61 | + |
| 62 | +All tokenizers support advanced encoding options, such as controlling normalization and pretokenization: |
| 63 | + |
| 64 | +:::code language="csharp" source="./snippets/use-tokenizers/csharp/TokenizersExamples/LlamaExample.cs" id="LlamaAdvanced"::: |
| 65 | + |
| 66 | +## Use BPE tokenizer |
| 67 | + |
| 68 | +*Byte-pair encoding* (BPE) is the underlying algorithm used by many tokenizers, including Tiktoken. BPE was initially developed as an algorithm to compress texts, and then used by OpenAI for tokenization when it pretrained the GPT model. The following example demonstrates BPE tokenization: |
| 69 | + |
| 70 | +:::code language="csharp" source="./snippets/use-tokenizers/csharp/TokenizersExamples/BpeExample.cs" id="BpeBasic"::: |
| 71 | + |
| 72 | +The library also provides specialized tokenizers like <xref:Microsoft.ML.Tokenizers.BpeTokenizer> and <xref:Microsoft.ML.Tokenizers.EnglishRobertaTokenizer> that you can configure with custom vocabularies for specific models. |
| 73 | + |
| 74 | +For more information about BPE, see [Byte-pair encoding tokenization](https://huggingface.co/learn/llm-course/chapter6/5). |
| 75 | + |
| 76 | +## Common tokenizer operations |
| 77 | + |
| 78 | +All tokenizers in the library implement the <xref:Microsoft.ML.Tokenizers.Tokenizer> base class. The following table shows the available methods. |
| 79 | + |
| 80 | +| Method | Description | |
| 81 | +|-------------------------------------------------------|--------------------------------------| |
| 82 | +| <xref:Microsoft.ML.Tokenizers.Tokenizer.EncodeToIds*> | Converts text to a list of token IDs. | |
| 83 | +| <xref:Microsoft.ML.Tokenizers.Tokenizer.Decode*> | Converts token IDs back to text. | |
| 84 | +| <xref:Microsoft.ML.Tokenizers.Tokenizer.CountTokens*> | Returns the number of tokens in a text string. | |
| 85 | +| <xref:Microsoft.ML.Tokenizers.Tokenizer.EncodeToTokens*> | Returns detailed token information including values and IDs. | |
| 86 | +| <xref:Microsoft.ML.Tokenizers.Tokenizer.GetIndexByTokenCount*> | Finds the character index for a specific token count from the start. | |
| 87 | +| <xref:Microsoft.ML.Tokenizers.Tokenizer.GetIndexByTokenCountFromEnd*> | Finds the character index for a specific token count from the end. | |
| 88 | + |
| 89 | +## Migrate from other libraries |
| 90 | + |
| 91 | +If you're currently using `DeepDev.TokenizerLib` or `SharpToken`, consider migrating to Microsoft.ML.Tokenizers. The library has been enhanced to cover scenarios from those libraries and provides better performance and support. For migration guidance, see the [migration guide](https://github.com/dotnet/machinelearning/blob/main/docs/code/microsoft-ml-tokenizers-migration-guide.md). |
| 92 | + |
| 93 | +## Related content |
| 94 | + |
| 95 | +- [Understanding tokens](../conceptual/understanding-tokens.md) |
| 96 | +- [Microsoft.ML.Tokenizers API reference](/dotnet/api/microsoft.ml.tokenizers) |
| 97 | +- [Microsoft.ML.Tokenizers NuGet package](https://www.nuget.org/packages/Microsoft.ML.Tokenizers) |
0 commit comments