HybridChunker with tiktoken tokenizer #1031
-
Hi, What would be the recommended way to use Thank you for the amazing library! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hi @ruizguille 👋 At the moment, That said, the actual text splitting library used in parts of the workflow —semchunk— already supports tiktoken. So one could expand the 👉 Based on the usage of
Would you be interested in submitting a PR yourself? 🙌 |
Beta Was this translation helpful? Give feedback.
-
Hi @vagenas, Thanks a lot for the details! |
Beta Was this translation helpful? Give feedback.
Hi @ruizguille 👋
At the moment,
HybridChunker
itself indeed only supports HF tokenizers (transformers.PreTrainedTokenizerBase
).That said, the actual text splitting library used in parts of the workflow —semchunk— already supports tiktoken.
So one could expand the
HybridChunker
such that it can operate with both, by allowingself._tokenizer
be of that Union.👉 Based on the usage of
self._tokenizer
, one would need to resolve the following for tiktoken (equivalently to HF):