Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Version 0.6.0 of the
tiktoken-rs
crate was released. The new version makes some methods private that we used to get the raw bytes for tokens. This PR changes how we build OpenAI tokenizers from usingtiktoken-rs
to interning and reading thetiktoken
data files directly.Main changes:
bpe-openai
. This is similar to what we did for benchmarks before. Without this, we'd have to duplicate some of the logic to instantiate at least one OpenAI tokenizer in thebpe
crate to be able to run tests..tiktoken
files to thebpe-openai
crate and build serialized tokenizers from those.bpe-openai
public methods to the official token set names (e.g.cl100k_base
instead of justcl100k
).bpe
got some functions to read tiktoken data, and does not depend (conditionally) ontiktoken-rs
anymore.tiktoken-rs
dependencies to version 0.6.