Skip to content

C++ implementations for various tokenizers (sentencepiece, tiktoken etc).

License

Notifications You must be signed in to change notification settings

pytorch-labs/tokenizers

tokenizers

C++ implementations for various tokenizers (sentencepiece, tiktoken etc). Useful for other PyTorch repos such as torchchat, ExecuTorch to build LLM runners using ExecuTorch stack or AOT Inductor stack.

SentencePiece tokenizer

Depend on https://github.com/google/sentencepiece from Google.

Tiktoken tokenizer

Adopted from https://github.com/sewenew/tokenizer.

License

tokenizers is released under the BSD 3 license. (Additional code in this distribution is covered by the MIT and Apache Open Source licenses.) However you may have other legal obligations that govern your use of content, such as the terms of service for third-party models.

About

C++ implementations for various tokenizers (sentencepiece, tiktoken etc).

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •