You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 23, 2024. It is now read-only.
In the original BPE paper, as well as in the BPE dropout paper, the authors apply word-based tokenization (namely, the Moses tokenizer, as well as some others) before the main algorithm. However, this project's readme is somewhat vague regarding this detail. Do I understand it correctly that the only word-based tokenization implemented is basically splitting on spaces and that's it?
What confuses me is this quote: ours does not consider tokens that cross word boundaries. For some languages it's impossible not to consider tokens that cross word boundaries based on spaces alone. So my question as follows: is there a more sophisticated word-based tokenizer under the hood after all?
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hi,
In the original BPE paper, as well as in the BPE dropout paper, the authors apply word-based tokenization (namely, the Moses tokenizer, as well as some others) before the main algorithm. However, this project's readme is somewhat vague regarding this detail. Do I understand it correctly that the only word-based tokenization implemented is basically splitting on spaces and that's it?
What confuses me is this quote:
ours does not consider tokens that cross word boundaries
. For some languages it's impossible not to consider tokens that cross word boundaries based on spaces alone. So my question as follows: is there a more sophisticated word-based tokenizer under the hood after all?The text was updated successfully, but these errors were encountered: