support tiktoken #7

chengchingwen · 2023-10-06T12:12:44Z

Support the tokenizers provided by tiktoken. The tokenizer files are re-host on my gist due to Artifacts.jl lacking support for lazy artifact without unpacking.

cc @ztangent

codecov · 2023-10-06T12:15:37Z

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (ad9d33b) 80.20% compared to head (e2efa4e) 83.63%.

❗ Current head e2efa4e differs from pull request most recent head 3c2b05e. Consider uploading reports for the commit 3c2b05e to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master       #7      +/-   ##
==========================================
+ Coverage   80.20%   83.63%   +3.43%     
==========================================
  Files           8        9       +1     
  Lines         389      495     +106     
==========================================
+ Hits          312      414     +102     
- Misses         77       81       +4

Files	Coverage Δ
src/BytePairEncoding.jl	`100.00% <ø> (ø)`
src/bpe.jl	`91.97% <100.00%> (+0.92%)`	⬆️
src/bytefallback.jl	`0.00% <ø> (ø)`
src/gpt2_utils.jl	`100.00% <100.00%> (ø)`
src/tokenization.jl	`77.77% <100.00%> (+6.34%)`	⬆️
src/tiktoken.jl	`94.62% <94.62%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

chengchingwen · 2023-10-06T13:31:14Z

julia> using BytePairEncoding, TextEncodeBase

julia> tkr = BytePairEncoding.load_tiktoken("cl100k_base")
FlatTokenizer(MatchTokenization(BPETokenization(Cl100kBaseTokenization, bpe = TikTokenBPE(100256 merges)), 5 patterns))

julia> tkr(TextEncodeBase.Sentence("hello world aaaaaaaaaaaa"))
5-element Vector{TextEncodeBase.TokenStage}:
 Token("hello", (ismatch = false,))
 Token(" world", (ismatch = false,))
 Token(" a", (ismatch = false,))
 Token("aaaaaaaa", (ismatch = false,))
 Token("aaa", (ismatch = false,))

add tiktoken bpe

01ade81

chengchingwen mentioned this pull request Oct 6, 2023

Support for cl100k_base encoding, used by new OpenAI models? #6

Closed

chengchingwen added 2 commits October 6, 2023 21:17

add loader

0bc7242

small fix

f11b590

chengchingwen marked this pull request as ready for review October 6, 2023 13:24

chengchingwen added 4 commits October 7, 2023 13:45

test with pythoncall

8cb7edc

fix test artifact path

fffd626

add converter between tiktoken and bbpe

e2efa4e

make conversion reversible

3c2b05e

chengchingwen merged commit 3c2b05e into master Oct 7, 2023
0 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support tiktoken #7

support tiktoken #7

chengchingwen commented Oct 6, 2023

codecov bot commented Oct 6, 2023 •

edited

Loading

chengchingwen commented Oct 6, 2023

support tiktoken #7

support tiktoken #7

Conversation

chengchingwen commented Oct 6, 2023

codecov bot commented Oct 6, 2023 • edited Loading

Codecov Report

chengchingwen commented Oct 6, 2023

codecov bot commented Oct 6, 2023 •

edited

Loading