You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 23, 2024. It is now read-only.
I've trained a tokenizer with 50k vocab and over 500M sentences. I'm in a situation where I'm encoding many keywords that contains OOV tokens which the tokenizer is doing not-so-good job in tokenizing. I was wondering if there's any way to perhaps introducing an option to allow users to modify the vocab after the tokenizer is trained. I've seen the issue where the discussion was to train a tokenizer on data that contains these oov terms in some range (1000?), so that the tokenizer can identify them during training and can add it them to the vocab. But the issue here is, there is no determined way to know which of these terms needs to included in training data! Any thoughts on how to handle such situations?
I've trained a tokenizer with 50k vocab and over 500M sentences. I'm in a situation where I'm encoding many keywords that contains OOV tokens which the tokenizer is doing not-so-good job in tokenizing. I was wondering if there's any way to perhaps introducing an option to allow users to modify the vocab after the tokenizer is trained. I've seen the issue where the discussion was to train a tokenizer on data that contains these oov terms in some range (1000?), so that the tokenizer can identify them during training and can add it them to the vocab. But the issue here is, there is no determined way to know which of these terms needs to included in training data! Any thoughts on how to handle such situations?
Generates following tokens:
The text was updated successfully, but these errors were encountered: