HybridChunker with tiktoken tokenizer #1031

ruizguille · 2025-02-20T20:11:21Z

ruizguille
Feb 20, 2025

Hi,

What would be the recommended way to use HybridChunker with the tiktoken tokenizer for use with OpenAI's embedding models? Reading the code, it seems to accept only HuggingFace tokenizers.

Thank you for the amazing library!

Answered by vagenas

Mar 3, 2025

Hi @ruizguille 👋

At the moment, HybridChunker itself indeed only supports HF tokenizers (transformers.PreTrainedTokenizerBase).

That said, the actual text splitting library used in parts of the workflow —semchunk— already supports tiktoken.

So one could expand the HybridChunker such that it can operate with both, by allowing self._tokenizer be of that Union.

👉 Based on the usage of self._tokenizer, one would need to resolve the following for tiktoken (equivalently to HF):

the max tokens allowed for the model
the number of tokens a given piece of text would correspond to
additionally, as string input is still supported, one would have to decide which of HF / tiktoken that would get mapped…

View full answer

vagenas · 2025-03-03T09:03:18Z

vagenas
Mar 3, 2025
Maintainer

Hi @ruizguille 👋

At the moment, HybridChunker itself indeed only supports HF tokenizers (transformers.PreTrainedTokenizerBase).

That said, the actual text splitting library used in parts of the workflow —semchunk— already supports tiktoken.

So one could expand the HybridChunker such that it can operate with both, by allowing self._tokenizer be of that Union.

👉 Based on the usage of self._tokenizer, one would need to resolve the following for tiktoken (equivalently to HF):

the max tokens allowed for the model
the number of tokens a given piece of text would correspond to
additionally, as string input is still supported, one would have to decide which of HF / tiktoken that would get mapped to (I'd say HF for backwards compatibility — but one could also pursue both on a trial-and-error basis)

Would you be interested in submitting a PR yourself? 🙌

0 replies

ruizguille · 2025-03-04T15:38:40Z

ruizguille
Mar 4, 2025
Author

Hi @vagenas,

Thanks a lot for the details!
Sounds great, it would be very useful to have tiktoken support. I will take a look at the code in depth and submit the PR.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HybridChunker with tiktoken tokenizer #1031

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

HybridChunker with tiktoken tokenizer #1031

ruizguille Feb 20, 2025

Replies: 2 comments

vagenas Mar 3, 2025 Maintainer

ruizguille Mar 4, 2025 Author

ruizguille
Feb 20, 2025

vagenas
Mar 3, 2025
Maintainer

ruizguille
Mar 4, 2025
Author