New Bonus Materials: Byte Pair Encoding (BPE) Tokenizer From Scratch #489
rasbt
announced in
Announcements
Replies: 3 comments 11 replies
-
What I have not yet fully understood is why preserving an empty space in some tokens in the BPE's training process rather treating empty spaces as separate tokens. Is this because it would grow the context window utilization significantly? |
Beta Was this translation helpful? Give feedback.
7 replies
-
Hah, he's got a good point! Not all zeros are the same kind of zero
…On Sat, Jan 18, 2025 at 6:09 PM Sebastian Raschka ***@***.***> wrote:
Great question. I am also not sure about the history behind it but I
strongly suspect it's because keeping the context window util small. Note
that if you have multiple white spaces after each other, they get treated
as separate white space characters, so white space characters do exist in
GPT-2 tokenizers.
Screenshot.2025-01-18.at.12.08.48.PM.png (view on web)
<https://github.com/user-attachments/assets/5d78d0df-d3d5-403f-b5ef-e3a90e6a9583>
—
Reply to this email directly, view it on GitHub
<#489 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADKNJ3VFQJSWQAQLJKGZMJD2LKKF7AVCNFSM6AAAAABVMPI5VWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCOBXGY3TONQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***
.com>
|
Beta Was this translation helpful? Give feedback.
0 replies
-
BTW as you mentioned
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
# Initialize tokenizer
tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
# Configure trainer
trainer = BpeTrainer(
vocab_size=1000,
special_tokens=["<|endoftext|>"]
)
# Train the tokenizer
tokenizer.train(files=["the-verdict.txt"], trainer=trainer)
# Save the tokenizer
tokenizer.save("tokenizer.json")
|
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all, here's some new bonus material that I thought you might enjoy 😊
Byte Pair Encoding (BPE) Tokenizer From Scratch
Happy weekend!
Beta Was this translation helpful? Give feedback.
All reactions