You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a question mark character in one of the Universal Dependencies datasets which gets wiped out by the tokenizer for the Italian bert & electra models:
according to emacs describe-char, it is character 0xFE4FA
Anyway, hopefully that's enough background to figure out which character is causing the problem. If I run the following sentences through the tokenizer with tokenizer.tokenize(sentence) I get the following:
ewww 🐈 — in viaggio Roma # another random character
ewww — in viaggio Roma # to test, maybe need to check that this is the weird character, not just a box
ewww — in viaggio Roma
The missing word causes confusion for me when trying to correlate the Bert embeddings with the words they represent. Can the tokenizer be fixed to treat that character (or any other strange character) as [UNK] as well?
The text was updated successfully, but these errors were encountered:
There is a question mark character in one of the Universal Dependencies datasets which gets wiped out by the tokenizer for the Italian bert & electra models:
https://github.com/UniversalDependencies/UD_Italian-PoSTWITA
warning: big file
https://raw.githubusercontent.com/UniversalDependencies/UD_Italian-PoSTWITA/master/it_postwita-ud-train.conllu
search for "ewww" in the training file
It looks like this if I copy and paste it:
ewww — in viaggio Roma
according to emacs describe-char, it is character 0xFE4FA
Anyway, hopefully that's enough background to figure out which character is causing the problem. If I run the following sentences through the tokenizer with
tokenizer.tokenize(sentence)
I get the following:The missing word causes confusion for me when trying to correlate the Bert embeddings with the words they represent. Can the tokenizer be fixed to treat that character (or any other strange character) as
[UNK]
as well?The text was updated successfully, but these errors were encountered: