Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer results in blank token for extended UTF-8 characters #48

Open
AngledLuffa opened this issue Feb 28, 2023 · 0 comments
Open

Tokenizer results in blank token for extended UTF-8 characters #48

AngledLuffa opened this issue Feb 28, 2023 · 0 comments
Assignees

Comments

@AngledLuffa
Copy link

There is a question mark character in one of the Universal Dependencies datasets which gets wiped out by the tokenizer for the Italian bert & electra models:

https://github.com/UniversalDependencies/UD_Italian-PoSTWITA

warning: big file
https://raw.githubusercontent.com/UniversalDependencies/UD_Italian-PoSTWITA/master/it_postwita-ud-train.conllu

search for "ewww" in the training file

It looks like this if I copy and paste it:

ewww 󾓺 — in viaggio Roma

according to emacs describe-char, it is character 0xFE4FA

Anyway, hopefully that's enough background to figure out which character is causing the problem. If I run the following sentences through the tokenizer with tokenizer.tokenize(sentence) I get the following:

ewww 🐈 — in viaggio Roma   # another random character
ewww 󾓺 — in viaggio Roma    # to test, maybe need to check that this is the weird character, not just a box
ewww — in viaggio Roma
# i printed the word pieces & their IDs
(['e', '##www', '[UNK]', '—', 'in', 'viaggio', 'Roma'], [126, 18224, 101, 986, 139, 2395, 2097])
(['e', '##www', '—', 'in', 'viaggio', 'Roma'], [126, 18224, 986, 139, 2395, 2097])
(['e', '##www', '—', 'in', 'viaggio', 'Roma'], [126, 18224, 986, 139, 2395, 2097])

The missing word causes confusion for me when trying to correlate the Bert embeddings with the words they represent. Can the tokenizer be fixed to treat that character (or any other strange character) as [UNK] as well?

@stefan-it stefan-it self-assigned this Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants