You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Normally when using a custom tokenizer with torchtext fields, you can pass the tokenizer function to the Field constructor and then build a vocab attribute which keeps track of the stoi mapping.
TEXT=Field(sequential=True, tokenize=my_tokenizer_fn)
TEXT.build_vocab(train_data) # builds the stoi/itos mapping
Since 🤗 tokenizers build their own vocab mappings, what's the best way to use them with torchtext, for example to use one of their datasets? If you just did the above, the TEXT.vocab mappings wouldn't match the tokenizer mappings. Unfortunately I haven't seen a simple way of using custom mappings in torchtext. The best solution I've found so far is to follow the above procedure and then manually override the TEXT vocab with the tokenizer one. So that would look something like this:
Is there a more straightforward way to do this? If not, it might be handy to have a helper function and/or example for others to reference since torchtext is so ubiquitous.
The text was updated successfully, but these errors were encountered:
Normally when using a custom tokenizer with torchtext fields, you can pass the tokenizer function to the
Field
constructor and then build a vocab attribute which keeps track of the stoi mapping.Since 🤗 tokenizers build their own vocab mappings, what's the best way to use them with torchtext, for example to use one of their datasets? If you just did the above, the TEXT.vocab mappings wouldn't match the tokenizer mappings. Unfortunately I haven't seen a simple way of using custom mappings in torchtext. The best solution I've found so far is to follow the above procedure and then manually override the
TEXT
vocab with thetokenizer
one. So that would look something like this:Is there a more straightforward way to do this? If not, it might be handy to have a helper function and/or example for others to reference since torchtext is so ubiquitous.
The text was updated successfully, but these errors were encountered: