Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade tiktoken-rs #29

Merged
merged 6 commits into from
Oct 14, 2024
Merged

Upgrade tiktoken-rs #29

merged 6 commits into from
Oct 14, 2024

Conversation

hendrikvanantwerpen
Copy link
Contributor

@hendrikvanantwerpen hendrikvanantwerpen commented Oct 14, 2024

Version 0.6.0 of the tiktoken-rs crate was released. The new version makes some methods private that we used to get the raw bytes for tokens. This PR changes how we build OpenAI tokenizers from using tiktoken-rs to interning and reading the tiktoken data files directly.

Main changes:

  • Move all tests to a separate crate, so that tests can depend on bpe-openai. This is similar to what we did for benchmarks before. Without this, we'd have to duplicate some of the logic to instantiate at least one OpenAI tokenizer in the bpe crate to be able to run tests.
  • Add gzipped copies of the .tiktoken files to the bpe-openai crate and build serialized tokenizers from those.
  • Rename bpe-openai public methods to the official token set names (e.g. cl100k_base instead of just cl100k).
  • bpe got some functions to read tiktoken data, and does not depend (conditionally) on tiktoken-rs anymore.
  • Upgrade all tiktoken-rs dependencies to version 0.6.

@hendrikvanantwerpen hendrikvanantwerpen marked this pull request as ready for review October 14, 2024 14:12
Copy link
Collaborator

@aneubeck aneubeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Maybe add a comment somewhere why the tests live in another folder.
It's obviously not perfect that we now have to compile two crates first, before we can run any tests... But, I don't really see an easy way around that problem.

@hendrikvanantwerpen hendrikvanantwerpen merged commit 8ecf192 into main Oct 14, 2024
3 checks passed
@hendrikvanantwerpen hendrikvanantwerpen deleted the intern-tiktoken-data branch October 14, 2024 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants