Skip to content

HybridChunker with tiktoken tokenizer #1031

Answered by vagenas
ruizguille asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @ruizguille 👋

At the moment, HybridChunker itself indeed only supports HF tokenizers (transformers.PreTrainedTokenizerBase).

That said, the actual text splitting library used in parts of the workflow —semchunk— already supports tiktoken.

So one could expand the HybridChunker such that it can operate with both, by allowing self._tokenizer be of that Union.

👉 Based on the usage of self._tokenizer, one would need to resolve the following for tiktoken (equivalently to HF):

  • the max tokens allowed for the model
  • the number of tokens a given piece of text would correspond to
  • additionally, as string input is still supported, one would have to decide which of HF / tiktoken that would get mapped…

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by ruizguille
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants